I know that there are a bunch of other similar questions to this, but I have built off other answers with no success.
I've dug here, here, here, here, and here
but this question is closest to what I'm trying to do, however it's in php and I'm using python3
My goal is to extract a substring from a body text.
The body is formatted:
**Header1**
thing1
thing2
thing3
thing4
**Header2**
dsfgs
sdgsg
rrrrrr
**Hello Dolly**
abider
abcder
ffffff
etc.
Formatting on SO is tough. But in the actual text, there's no spaces, just newlines for each line.
I want what's under Header2, so currently I have:
found = re.search("\*\*Header2\*\*\n[^*]+",body)
if found:
list = found.group(0)
list = list[11:]
list = list.split('\n')
print(list)
But that's returning "None". Various other regex I've tried also haven't worked, or grabbed too much (all of the remaining headers).
For what it's worth I've also tried:
\*\*Header2\*\*.+?^\**$
\*\*Header2\*\*[^*\s\S]+\*\* and about 10 other permutations of those.
Brief
Your pattern \*\*Header2\*\*\n[^*]+ isn't matching because your line **Header2** includes trailing spaces before the newline character. Adding * should suffice, but I've added other options below as well.
Code
See regex in use here
\*{2}Header2\*{2} *\n([^*]+)
Alternatively, you can also use the following regex (which also allows you to capture lines with * in them so long as they don't match the format of your header ^\*{2}[^*]*\*{2} - it also beautifully removes whitespace from the last element under the header - uses the im flags):
See regex in use here
^\*{2}Header2\*{2} *\n((?:(?!^\*{2}[^*]*\*{2}).)*?)(?=\s*^\*{2}[^*]*\*{2}|\s*\Z)
Usage
See code in use here
import re
regex = r"\*{2}Header2\*{2}\s*([^*]+)\s*"
test_str = ("**Header1** \n"
"thing1 \n"
"thing2 \n"
"thing3 \n"
"thing4 \n\n"
"**Header2** \n"
"dsfgs \n"
"sdgsg \n"
"rrrrrr \n\n"
"**Hello Dolly** \n"
"abider \n"
"abcder \n"
"ffffff")
print(re.search(regex, test_str).group(1))
Explanation
The pattern is practically identical to the OP's original pattern. I made minor changes to allow it to better perform and also get the result the OP is expecting.
\*\* changed to \*{2}: Very minor adjustment for performance
\n changed to *\n: Takes additional spaces at the end of a line into account before the newline character
([^*]+): Captures the contents the OP is expecting into capture group 1
You could use
^\*\*Header2\*\*.*[\n\r]
(?P<content>(?:.+[\n\r])+)
with the multiline and verbose modifier, see a demo on regex101.com.
Afterwards, just grab what is inside content (i.e. using re.finditer()).
Broken down this says:
^\*\*Header2\*\*.*[\n\r] # match **Header2** at the start of the line
# and newline characters
(?P<content>(?:.+[\n\r])+) # afterwards match as many non-null lines as possible
In Python:
import re
rx = re.compile(r'''
^\*\*Header2\*\*.*[\n\r]
(?P<content>(?:.+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
for match in rx.finditer(your_string_here):
print(match.group('content'))
I have the feeling that you even want to allow empty lines between paragraphs. If so, change the expression to
^\*\*Header2\*\*.*[\n\r]
(?P<content>[\s\S]+?)
(?=^\*\*)
See a demo for the latter on regex101.com as well.
You can try this:
import re
s = """
**Header1**
thing1
thing2
thing3
thing4
**Header2**
dsfgs
sdgsg
rrrrrr
**Hello Dolly**
abider
abcder
ffffff
"""
new_contents = re.findall('(?<=\*\*Header2\*\*)[\n\sa-zA-Z0-9]+', s)
Output:
[' \ndsfgs \nsdgsg \nrrrrrr \n\n']
If you want to remove special characters from the output, you can try this:
final_data = filter(None, re.split('\s+', re.sub('\n+', '', new_contents[0])))
Output:
['dsfgs', 'sdgsg', 'rrrrrr']
Related
So I have several examples of raw text in which I have to extract the characters after 'Terms'. The common pattern I see is after the word 'Terms' there is a '\n' and also at the end '\n' I want to extract all the characters(words, numbers, symbols) present between these to \n but after keyword 'Terms'.
Some examples of text are given below:
1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n
The code I have written is given below:
def get_term_regex(s):
raw_text = s
term_regex1 = r'(TERMS\s*\\n(.*?)\\n)'
try:
if ('TERMS' or 'Terms') in raw_text:
pattern1 = re.search(term_regex1,raw_text)
#print(pattern1)
return pattern1
except:
pass
But I am not getting any output, as there is no match.
The expected output is:
1) Direct deposit; Routing #256078514, acct. #160935
2) Due on receipt
3) NET 30 DAYS
Any help would be really appreciated.
Try the following:
import re
text = '''1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n''' # \n are real new lines
for m in re.finditer(r'(TERMS|Terms)\W*\n(.*?)\n', text):
print(m.group(2))
Note that your regex could not deal with the third 'line' because there is a colon : after TERMS. So I replaced \s with \W.
('TERMS' or 'Terms') in raw_text might not be what you want. It does not raise a syntax error, but it is just the same as 'TERMS' in raw_text; when python evaluates the parenthesis part, both 'TERMS' and 'Terms' are all truthy, and therefore python just takes the last truthy value, i.e., 'Terms'. The result is, TERMS cannot be picked up by that part!
So you might instead want someting like ('TERMS' in raw_text) or ('Terms' in raw_text), although it is quite verbose.
I'm writing a python script to parse VTT subtitle files.
I am using a regular expression to match and extract specific elements:
'in timecode'
'out timecode'
'other info' (mostly alignment information, like align:middle or line:-1)
subtitle content (the actual text)
I am using Python's 're' module from the standard library, and I am looking for a regular expression that will match all (5) of the below 'subtitle events':
WEBVTT
00:00:00.440 --> 00:00:02.320 align:middle line:-1
Hi.
00:00:03.440 --> 00:00:07.520 align:middle line:-1
This subtitle has one line.
00:00:09.240 --> 00:00:11.080 align:middle line:-2
This subtitle has
two lines.
00:00:15.240 --> 00:00:23.960 align:middle line:-4
Now...
Let's try
four...
lines...
00:00:24.080 --> 00:00:27.080 align:middle
PS: Note that stackoverflow doesn't allow me to add an empty line at the end of the code block. Normally the last 'empty' line will exist because a line break (\r\n or \n). After: 00:00:24.080 --> 00:00:27.080 align:middle
Below is my code. My problem is that I can't figure out a regular expression that will match all of the 'subtitle events' (including the one with an empty line as 'subtitle content').
import re
import io
webvttFileObject = io.open("C:\Users\john.doe\Documents\subtitle_sample.vtt", 'r', encoding = 'utf-8') # opens WebVTT file forcing UTF-8 encoding
textBuffer = webvttFileObject.read()
regex = re.compile(r"""(^[0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3}) # match TC-IN in group1
[ ]-->[ ] # VTT/SRT style TC-IN--TC-OUT separator
([0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3}) # match TC-OUT n group2
(.*)?\n # additional VTT info (like) alignment
(^.+\n)+\n? # subtitle_content """, re.MULTILINE|re.VERBOSE)
subtitle_match_count = 0
for match in regex.finditer(textBuffer):
subtitle_match_count += 1
group1, group2, group3, group4 = match.groups()
tc_in = group1.strip()
tc_out = group2.strip()
vtt_extra_info = group3
subtitle_content = group4
print "*** subtitle match count: %d ***" % subtitle_match_count
print "TIMECODE IN".ljust(20), tc_in
print "TIMECODE OUT".ljust(20), tc_out
print "ALIGN".ljust(20), vtt_extra_info.strip()
print "SUBTITLE CONTENT".ljust(20), subtitle_content
print
I've tried several variations of the regex in the code. All without success. What is also very strange to me is that if I put regex groups in a variable and print them, like I'm doing with this code, I only get the last line as SUBTITLE CONTENT. But I must be doing something wrong (right?). Any help is greatly appreciated.
Thanks in advance.
The reason why your regex doesn't match the last subtitle is here:
(^.+\n)+\n?
The ^.+\n is looking for a line with 1 or more characters. But the last line in the file is empty, so it doesn't match.
The reason why subtitle_content only contains the last line is also there. You're matching each line one by one with (^.+\n)+, i.e. the capture group always captures only a single line. With each matched line, the capture group's previous value is discarded, so in the end all you're left with is the last line. If you want to capture all lines, you have match them all in one go inside of the capture group, for example like this:
((?:^.+\n)+)
In order to make the regex work correctly, I've slightly changed the last two lines:
(^[0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3})
[ ]-->[ ]
([0-9]{2}[:][0-9]{2}[:][0-9]{2}[.,][0-9]{3})
([^\n]*)?\n # replaced `.*` with `[^\n]*` here because of the S-modifier
(.*?)(?:\n\n|\Z) # this now captures everything up to 2 consecutive
# newlines or the end of the string
This regex requires the modifiers m (multiline), s (single-line) and of course x (verbose).
See it in action here.
I have a text file with some names and emails and other stuff. I want to capture email addresses.
I don't know whether this is a split or regex problem.
Here are some sample lines:
[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79
I want to be able to do a loop that prints all the email addresses.
Thanks.
I'd use a regex:
import re
data = '''[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79'''
group_matcher = re.compile(r'\[(.*?)\]([^\[]+)')
for line in data.split('\n'):
o = dict(group_matcher.findall(line))
print o['email']
\[ is literally [.
(.*?) is a non-greedy capturing group. It "expands" to capture the text.
\] is literally ]
( is the beginning of a capturing group.
[^\[] matches anything but a [.
+ repeats the last pattern any number of times.
) closes the capturing group.
for line in lines:
print line.split("]")[2].split(" ")[0]
You can pass substrings to split, not just single characters, so:
email = line.partition('[email]')[-1].partition('[')[0].rstrip()
This has an advantage over the simple split solutions that it will work on fields that can have spaces in the value, on lines that have things in a different order (even if they have [email] as the last field), etc.
To generalize it:
def get_field(line, field):
return line.partition('[{}]'.format(field)][-1].partition('[')[0].rstrip()
However, I think it's still more complicated than the regex solution. Plus, it can only search for exactly one field at a time, instead of all fields at once (without making it even more complicated). To get two fields, you'll end up parsing each line twice, like this:
for line in data.splitlines():
print '''{} "babysat" Dan O'Brien on {}'''.format(get_field(line, 'name'),
get_field(line, 'dob'))
(I may have misinterpreted the DOB field, of course.)
You can split by space and then search for the element that starts with [email]:
line = '[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81'
items = line.split()
for item in items:
if item.startswith('[email]'):
print item.replace('[email]', '', 1)
say you have a file with lines.
import re
f = open("logfile", "r")
data = f.read()
for line in data.split("\n"):
match=re.search("email\](?P<id>.*)\[dob", line)
if match:
# either store or print the emails as you like
print match.group('id').strip(), "\n"
Thats all (try it, for python 3 n above remember print is a function make those changes ) !
The output from your sample data:
bill.billy#hotmail.com
mark.hilly#hotmail.com
gill.silly#hotmail.com
>>>
I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']
I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy
The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers
That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).
p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.
As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.
(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'
You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.