python regex for movie subtitles

python regex for movie subtitles - python

I'm trying to make a simple regex that would recognize micro dvd format:
{52}{118}some text
{123}{202}some text
{203}{259}some text
{261}{309}some text
My code looks lke the following. match_obj is None and I don't know why:
import re
my_re = r"\{([0-9]*)\}\{[0-9]\}(.*)"
f = open('abc.txt')
match_obj = re.match(my_re, f.readline())
I have tried also:
match_obj = re.match(my_re, f.readline(), re.M|re.I)
with the same results.

You're very close - you're just missing a repeat symbol in the second number section. Your regex should look like this:
my_re = r"\{([0-9]*)\}\{[0-9]*\}(.*)"
Notice the added asterisk after the second [] block.

\{([0-9]*)\}\{[0-9] \}(.*)
/|\
|
You're missing a repeater in your second number character class.
I'm not sure about the rules of movie subtitles, but I would assume the brackets can not be empty.
A stricter regex would then be (albeit, probably not needed in your case):
\{([0-9]+)\}\{[0-9]+\}(.*)
The + repeater means 1 or more. The * repeater means 0 or more.
Are you only interested in the first number?
Is the text meant to be optional?

Related

get all the text between two newline characters(\n) of a raw_text using python regex

So I have several examples of raw text in which I have to extract the characters after 'Terms'. The common pattern I see is after the word 'Terms' there is a '\n' and also at the end '\n' I want to extract all the characters(words, numbers, symbols) present between these to \n but after keyword 'Terms'.
Some examples of text are given below:
1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n
The code I have written is given below:
def get_term_regex(s):
raw_text = s
term_regex1 = r'(TERMS\s*\\n(.*?)\\n)'
try:
if ('TERMS' or 'Terms') in raw_text:
pattern1 = re.search(term_regex1,raw_text)
#print(pattern1)
return pattern1
except:
pass
But I am not getting any output, as there is no match.
The expected output is:
1) Direct deposit; Routing #256078514, acct. #160935
2) Due on receipt
3) NET 30 DAYS
Any help would be really appreciated.

Try the following:
import re
text = '''1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n''' # \n are real new lines
for m in re.finditer(r'(TERMS|Terms)\W*\n(.*?)\n', text):
print(m.group(2))
Note that your regex could not deal with the third 'line' because there is a colon : after TERMS. So I replaced \s with \W.
('TERMS' or 'Terms') in raw_text might not be what you want. It does not raise a syntax error, but it is just the same as 'TERMS' in raw_text; when python evaluates the parenthesis part, both 'TERMS' and 'Terms' are all truthy, and therefore python just takes the last truthy value, i.e., 'Terms'. The result is, TERMS cannot be picked up by that part!
So you might instead want someting like ('TERMS' in raw_text) or ('Terms' in raw_text), although it is quite verbose.

Regex search up to first instance Python

I know that there are a bunch of other similar questions to this, but I have built off other answers with no success.
I've dug here, here, here, here, and here
but this question is closest to what I'm trying to do, however it's in php and I'm using python3
My goal is to extract a substring from a body text.
The body is formatted:
**Header1**
thing1
thing2
thing3
thing4
**Header2**
dsfgs
sdgsg
rrrrrr
**Hello Dolly**
abider
abcder
ffffff
etc.
Formatting on SO is tough. But in the actual text, there's no spaces, just newlines for each line.
I want what's under Header2, so currently I have:
found = re.search("\*\*Header2\*\*\n[^*]+",body)
if found:
list = found.group(0)
list = list[11:]
list = list.split('\n')
print(list)
But that's returning "None". Various other regex I've tried also haven't worked, or grabbed too much (all of the remaining headers).
For what it's worth I've also tried:
\*\*Header2\*\*.+?^\**$
\*\*Header2\*\*[^*\s\S]+\*\* and about 10 other permutations of those.

Brief
Your pattern \*\*Header2\*\*\n[^*]+ isn't matching because your line **Header2** includes trailing spaces before the newline character. Adding * should suffice, but I've added other options below as well.
Code
See regex in use here
\*{2}Header2\*{2} *\n([^*]+)
Alternatively, you can also use the following regex (which also allows you to capture lines with * in them so long as they don't match the format of your header ^\*{2}[^*]*\*{2} - it also beautifully removes whitespace from the last element under the header - uses the im flags):
See regex in use here
^\*{2}Header2\*{2} *\n((?:(?!^\*{2}[^*]*\*{2}).)*?)(?=\s*^\*{2}[^*]*\*{2}|\s*\Z)
Usage
See code in use here
import re
regex = r"\*{2}Header2\*{2}\s*([^*]+)\s*"
test_str = ("**Header1** \n"
"thing1 \n"
"thing2 \n"
"thing3 \n"
"thing4 \n\n"
"**Header2** \n"
"dsfgs \n"
"sdgsg \n"
"rrrrrr \n\n"
"**Hello Dolly** \n"
"abider \n"
"abcder \n"
"ffffff")
print(re.search(regex, test_str).group(1))
Explanation
The pattern is practically identical to the OP's original pattern. I made minor changes to allow it to better perform and also get the result the OP is expecting.
\*\* changed to \*{2}: Very minor adjustment for performance
\n changed to *\n: Takes additional spaces at the end of a line into account before the newline character
([^*]+): Captures the contents the OP is expecting into capture group 1

You could use
^\*\*Header2\*\*.*[\n\r]
(?P<content>(?:.+[\n\r])+)
with the multiline and verbose modifier, see a demo on regex101.com.
Afterwards, just grab what is inside content (i.e. using re.finditer()).
Broken down this says:
^\*\*Header2\*\*.*[\n\r] # match **Header2** at the start of the line
# and newline characters
(?P<content>(?:.+[\n\r])+) # afterwards match as many non-null lines as possible
In Python:
import re
rx = re.compile(r'''
^\*\*Header2\*\*.*[\n\r]
(?P<content>(?:.+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
for match in rx.finditer(your_string_here):
print(match.group('content'))
I have the feeling that you even want to allow empty lines between paragraphs. If so, change the expression to
^\*\*Header2\*\*.*[\n\r]
(?P<content>[\s\S]+?)
(?=^\*\*)
See a demo for the latter on regex101.com as well.

You can try this:
import re
s = """
**Header1**
thing1
thing2
thing3
thing4
**Header2**
dsfgs
sdgsg
rrrrrr
**Hello Dolly**
abider
abcder
ffffff
"""
new_contents = re.findall('(?<=\*\*Header2\*\*)[\n\sa-zA-Z0-9]+', s)
Output:
[' \ndsfgs \nsdgsg \nrrrrrr \n\n']
If you want to remove special characters from the output, you can try this:
final_data = filter(None, re.split('\s+', re.sub('\n+', '', new_contents[0])))
Output:
['dsfgs', 'sdgsg', 'rrrrrr']

How to count sentences taking into account the occurrence of ellipses

I've written the following script to count the number of sentences in a text file:
import re
filepath = 'sample_text_with_ellipsis.txt'
with open(filepath, 'r') as f:
read_data = f.read()
sentences = re.split(r'[.{1}!?]+', read_data.replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
However, if I run it on a sample_text_with_ellipsis.txt with the following content:
Wait for it... awesome!
I get sentence_count = 2 instead of 1, because it does not ignore the ellipsis (i.e., the "...").
What I tried to do in the regex is to make it match only one occurrence of a period through .{1}, but this apparently doesn't work the way I intended it. How can I get the regex to ignore ellipses?

Splitting sentences with a regex like this is not enough. See Python split text on sentences to see how NLTK can be leveraged for this.
Answering your question, you call 3 dot sequence an ellipsis. Thus, you need to use
[!?]+|(?<!\.)\.(?!\.)
See the regex demo. The . is moved from the character class since you can't use quantifiers inside them, and only that . is matched that is not enclosed with other dots.
[!?]+ - 1 or more ! or ?
| - or
(?<!\.)\.(?!\.) - a dot that is neither preceded ((?<!\.)), nor followed ((?!\.)) with a dot.
See Python demo:
import re
sentences = re.split(r'[!?]+|(?<!\.)\.(?!\.)', "Wait for it... awesome!".replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
print(sentence_count) # => 1

Following Wiktor's suggestion to use NLTK, I also came up with the following alternative solution:
import nltk
read_data="Wait for it... awesome!"
sentence_count = len(nltk.tokenize.sent_tokenize(read_data))
This yields a sentence count of 1 as expected.

Python regex: find lines where period is missing in

I'm looking for a regular expression, implemented in Python, that will match on this text
WHERE PolicyGUID = '531B2310-403A-13DA-5964-E2EFA56B0753'
but will not match on this text
WHERE AsPolicy.PolicyGUID = '531B2310-403A-13DA-5964-E2EFA56B0753'
I'm doing this to find places in a large piece of SQL where the developer did not explicitly reference the table name. All I want to do is print the offending lines (the first WHERE clause above). I have all of the code done except for the regex.

re.compile('''WHERE [^.]+ =''')
Here, the [] indicates "match a set of characters," the ^ means "not" and the dot is a literal period. The + means "one or more."
Was that what you were looking for?

something like
WHERE .*\..* = .*
not sure how accurate can be, it depends on how your data looks... If you provide a bigger sample it can be refined

Something like this would work in java, c#, javascript, I suppose you can adapt it to python:
/WHERE +[^\.]+ *\=/

>>> l
["WHERE PolicyGUID = '531B2310-403A-13DA-5964-E2EFA56B0753' ", "WHERE AsPolicy.P
olicyGUID = '531B2310-403A-13DA-5964-E2EFA56B0753' "]
>>> [line for line in l if re.match('WHERE [^.]+ =', line)]
["WHERE PolicyGUID = '531B2310-403A-13DA-5964-E2EFA56B0753' "]

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.

In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).

You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "

This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.

import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex for movie subtitles - python

You're very close - you're just missing a repeat symbol in the second number section. Your regex should look like this: my_re = r"\{([0-9])\}\{[0-9]\}(.*)" Notice the added asterisk after the second [] block.

Related

get all the text between two newline characters(\n) of a raw_text using python regex

Regex search up to first instance Python

How to count sentences taking into account the occurrence of ellipses

Python regex: find lines where period is missing in

python regex for repeating string

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex for movie subtitles - python

You're very close - you're just missing a repeat symbol in the second number section. Your regex should look like this: my_re = r"\{([0-9]*)\}\{[0-9]*\}(.*)" Notice the added asterisk after the second [] block.

Related

get all the text between two newline characters(\n) of a raw_text using python regex

Regex search up to first instance Python

How to count sentences taking into account the occurrence of ellipses

Python regex: find lines where period is missing in

python regex for repeating string

Categories

Resources

You're very close - you're just missing a repeat symbol in the second number section. Your regex should look like this: my_re = r"\{([0-9])\}\{[0-9]\}(.*)" Notice the added asterisk after the second [] block.