Extract a string after a text with regex in Python - python

I have a doc file that it has the following structure:
This is a fairy tale written by
John Doe and Mary Smith
Auckland,somewhere
This story is awesome
I would like to extract the two lines of text which are:
John Doe and Mary Smith
Auckland,somewhere
and append those values into a list by using regex. The two lines that I want to extract are always between the lines This is a fairy tale written by and This story is awesome. How can I do that? I have tried some combinations with before_keyword,keyword,after_keyword=text.partition(regex), but no luck at all.

You can use a regex with re.DOTALL that enables . to match any character including newlines. Once you have the text between the two delimiters, you can use another regex without the re.DOTALL to extract lines that contain at least one non-whitespace character (\S).
import re
lst = []
with open('input.txt') as f:
text = f.read()
match = re.search('This is a fairy tale written by(.*?)This story is awesome',
text, re.DOTALL)
if match:
lst.extend(re.findall('.*\S.*', match.group(1)))
print(lst)
Gives:
[' John Doe and Mary Smith', ' Auckland,somewhere']

You may start with this:
re.search(r'(?<=This is a fairy tale written by\n).*?(?=\n\s*This story is awesome)', s, re.MULTILINE|re.DOTALL).group(0)
and fine-tune this regex. re.MULTILINE may be omitted as you do not have ^ or $ anyway, but re.DOTALL is required to let . to match newline as well. The regex above uses look ahead and look behind (?<=), (?=). If you do not like that, you can use parentheses instead for captures.

If you can create a list of strings from your docfile, then no need to use a regex. Just do this simple program:
fileContent = ['This is a fairy tale written by','John Doe and Mary Smith','Auckland,somewhere','This story is awesome',
'Some other things', 'story texts', 'Not Important data',
'This is a fairy tale written by','Kem Cho?','Majama?','This story is awesome', 'Not important data']
authorsList = []
for i in range(len(fileContent)-3):
if fileContent[i] == 'This is a fairy tale written by' and fileContent[i+3] == 'This story is awesome':
authorsList.append([fileContent[i+1], fileContent[i+2]])
print(authorsList)
Here I simply check for 'This is a fairy tale written by' and 'This story is awesome' and if it is found, append text between it in your list.
Output:
[['John Doe and Mary Smith', 'Auckland,somewhere'], ['Kem Cho?', 'Majama?']]

Try using this instead. It should match anything between these two strings.
re.search(r'(?<=This is a fairy tale).*?(?=This story is awesome)',text)

Related

Using regex how can we remove period '.' from prefix's like Mr. and Mrs. but not from the end of the sentences in a big paragraph or more?

Lang: Python. Using regex for instance if I use remove1 = re.sub('\.(?!$)', '', text), it removes all periods. I am only able to remove all periods, not just prefixes. Can anyone help, please? Just put the below text for example.
Mr. and Mrs. Jackson live up the street from us. However, Mrs. Jackson's son lives in the street parallel to us.
You can capture what you want to keep, and match the dot that you want to replace.
\b(Mrs?)\.
Regex demo
In the replacement use group 1 like \1
import re
pattern = r"\b(Mrs?)\."
s = ("Mr. and Mrs. Jackson live up the street from us. However, Mrs. Jackson's son lives in the street parallel to us.\n")
result = re.sub(pattern, r"\1", s)
print(result)
Output
Mr and Mrs Jackson live up the street from us. However, Mrs Jackson's son lives in the street parallel to us.

Iterate and match all elements with regex

So I have something like this:
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
I want to replace every element with the first name so it would look like this:
data = ['Alice Smith', 'Tim', 'Uncle Neo']
So far I got:
for i in range(len(data)):
if re.match('(.*) and|with|\&', data[i]):
a = re.match('(.*) and|with|\&', data[i])
data[i] = a.group(1)
But it doesn't seem to work, I think it's because of my pattern but I can't figure out the right way to do this.
Use a list comprehension with re.split:
result = [re.split(r' (?:and|with|&) ', x)[0] for x in data]
The | needs grouping with parentheses in your attempt. Anyway, it's too complex.
I would just use re.sub to remove the separation word & the rest:
data = [re.sub(" (and|with|&) .*","",d) for d in data]
result:
['Alice Smith', 'Tim', 'Uncle Neo']
You can try this:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
final_data = [re.sub('\sand.*?$|\s&.*?$|\swith.*?$', '', i) for i in data]
Output:
['Alice Smith', 'Tim', 'Uncle Neo']
Simplify your approach to the following:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
data = [re.search(r'.*(?= (and|with|&))', i).group() for i in data]
print(data)
The output:
['Alice Smith', 'Tim', 'Uncle Neo']
.*(?= (and|with|&)) - positive lookahead assertion, ensures that name/surname .* is followed by any item from the alternation group (and|with|&)
Brief
I would suggest using Casimir's answer if possible, but, if you are not sure what word might follow (that is to say that and, with, and & are dynamic), then you can use this regex.
Note: This regex will not work for some special cases such as names with apostrophes ' or dashes -, but you can add them to the character list that you're searching for. This answer also depends on the name beginning with an uppercase character and the "union word" as I'll name it (and, with, &, etc.) not beginning with an uppercase character.
Code
See this regex in use here
Regex
^((?:[A-Z][a-z]*\s*)+)\s.*
Substitution
$1
Result
Input
Alice Smith and Bob
Tim with Sam Dunken
Uncle Neo & 31
Output
Alice Smith
Tim
Uncle Neo
Explanation
Assert position at the beginning of the string ^
Match a capital alpha character [A-Z]
Match between any number of lowercase alpha characters [a-z]*
Match between any number of whitespace characters (you can specify spaces if you'd prefer using * instead) \s*
Match the above conditions between one and unlimited times, all captured into capture group 1 (...)+: where ... contains everything above
Match a whitespace character, followed by any character (except new line) any number of times
$1: Replace with capture group 1

strip white spaces and new lines when reading from file

I have the following code, that successfully strips end of line characters when reading from file, but doesn't do so for any leading and trailing white spaces (I want the spaces in between to be left!)
What is the best way to achieve this? (Note, this is a specific example, so not a duplicate of general methods to strip strings)
My code: (try it with the test data: "Mr Moose" (not found) and if you try "Mr Moose " (that is a space after the Moose) it will work.
#A COMMON ERROR is leaving in blank spaces and then finding you cannot work with the data in the way you want!
"""Try the following program with the input: Mr Moose
...it doesn't work..........
but if you try "Mr Moose " (that is a space after Moose..."), it will work!
So how to remove both new lines AND leading and trailing spaces when reading from a file into a list. Note, the middle spaces between words must remain?
"""
alldata=[]
col_num=0
teacher_names=[]
delimiter=":"
with open("teacherbook.txt") as f:
for line in f.readlines():
alldata.append((line.strip()))
print(alldata)
print()
print()
for x in alldata:
teacher_names.append(x.split(delimiter)[col_num])
teacher=input("Enter teacher you are looking for:")
if teacher in teacher_names:
print("found")
else:
print("No")
Desired output, on producing the list alldata
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
i.e - remove all leading and trailing white space at the start, and before or after the delimiter. The spaces in between words such as Mr Moose, must be left.
Contents of teacherbook:
Mr Moose : Maths
Mr Goose: History
Mrs Congenelipilling: English
Thanks in advance
You could use a regex:
txt='''\
Mr Moose : Maths
Mr Goose: History
Mrs Congenelipilling: English'''
>>> [re.sub(r'\s*:\s*', ':', line).strip() for line in txt.splitlines()]
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
So your code becomes:
import re
col_num=0
teacher_names=[]
delimiter=":"
with open("teacherbook.txt") as f:
alldata=[re.sub(r'\s*{}\s*'.format(delimiter), delimiter, line).rstrip() for line in f]
print(alldata)
for x in alldata:
teacher_names.append(x.split(delimiter)[col_num])
print(teacher_names)
Prints:
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
['Mr Moose', 'Mr Goose', 'Mrs Congenelipilling']
The key part is the regex:
re.sub(r'\s*{}\s*'.format(delimiter), delimiter, line).rstrip()
^ 0 to unlimited spaced before the delimiter
^ place for the delimiter
^ unlimited trailing space
Interactive Demo
For an all Python solution, I would use str.partition to get the left hand and right hand side of the delimiter then strip the whitespace as needed:
alldata=[]
with open("teacherbook.txt") as f:
for line in f:
lh,sep,rh=line.rstrip().partition(delimiter)
alldata.append(lh.rstrip() + sep + rh.lstrip())
Same output
Another suggestion. Your data is more suited to a dict than a list.
You can do:
di={}
with open("teacherbook.txt") as f:
for line in f:
lh,sep,rh=line.rstrip().partition(delimiter)
di[lh.rstrip()]=rh.lstrip()
Or comprehension version:
with open("teacherbook.txt") as f:
di={lh.rstrip():rh.lstrip()
for lh,_,rh in (line.rstrip().partition(delimiter) for line in f)}
Then access like this:
>>> di['Mr Moose']
'Maths'
No need to use readlines(), you can simply iterate through the file object to get each line, and use strip() to remove the \n and whitespaces. As such, you can use this list comprehension;
with open('teacherbook.txt') as f:
alldata = [':'.join([value.strip() for value in line.split(':')])
for line in f]
print(alldata)
Outputs;
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
Change:
teacher_names.append(x.split(delimiter)[col_num])
to:
teacher_names.append(x.split(delimiter)[col_num].strip())
remove all leading and trailing white space at the start, and before or after the delimiter. The spaces in between words such as Mr Moose, must be left.
You can split your string at the delimiter, strip the whitespace from them, and concatenate them back together again:
for line in f.readlines():
new_line = ':'.join([s.strip() for s in line.split(':')])
alldata.append(new_line)
Example:
>>> lines = [' Mr Moose : Maths', ' Mr Goose : History ']
>>> lines
[' Mr Moose : Maths', ' Mr Goose : History ']
>>> data = []
>>> for line in lines:
new_line = ':'.join([s.strip() for s in line.split(':')])
data.append(new_line)
>>> data
['Mr Moose:Maths', 'Mr Goose:History']
You can do it easily with regex - re.sub:
import re
re.sub(r"[\n \t]+$", "", "aaa \t asd \n ")
Out[17]: 'aaa \t asd'
first argument pattern - [all characters you want to remove]++ - one or more matches$$ - end of the string
https://docs.python.org/2/library/re.html
With string.rstrip('something') you can remove that 'something' from the right end of the string like this:
a = 'Mr Moose \n'
print a.rstrip(' \n') # prints 'Mr Moose\n' instead of 'Mr Moose \n\n'

Regex uppercase words with condition

I'm new to regex and I can't figure it out how to do this:
Hello this is JURASSIC WORLD shut up Ok
[REVIEW] The movie BATMAN is awesome lol
What I need is the title of the movie. It will be only one per sentence. I have to ignore the words between [] as it will not be the title of the movie.
I thought of this:
^\w([A-Z]{2,})+
Any help would be welcome.
Thanks.
You can use negative look arounds to ensure that the title is not within []
\b(?<!\[)[A-Z ]{2,}(?!\])\b
\b Matches word boundary.
(?<!\[) Negative look behind. Checks if the matched string is not preceded by [
[A-Z ]{2,} Matches 2 or more uppercase letters.
(?!\]) Negative look ahead. Ensures that the string is not followed by ]
Example
>>> string = """Hello this is JURASSIC WORLD shut up Ok
... [REVIEW] The movie BATMAN is awesome lol"""
>>> re.findall(r'\b(?<!\[)[A-Z ]{2,}(?!\])\b', string)
[' JURASSIC WORLD ', ' BATMAN ']
>>>

What is efficient way to match words in string?

Example:
names = ['James John', 'Robert David', 'Paul' ... the list has 5K items]
text1 = 'I saw James today'
text2 = 'I saw James John today'
text3 = 'I met Paul'
is_name_in_text(text1,names) # this returns false 'James' in not in list
is_name_in_text(text2,names) # this returns 'James John'
is_name_in_text(text3,names) # this return 'Paul'
is_name_in_text() searches if any of the name list is in text.
The easy way to do is to just check if the name is in the list by using in operator, but the list has 5,000 items, so it is not efficient. I can just split the text into words and check if the words are in the list, but this not going to work if you have more than one word matching. Line number 7 will fail in this case.
Make names into a set and use the in-operator for fast O(1) lookup.
You can use a regex to parse out the possible names in a sentence:
>>> import re
>>> findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')
>>> def is_name_in_text(text, names):
for possible_name in set(findnames.findall(text)):
if possible_name in names:
return possible_name
return False
>>> names = set(['James John', 'Robert David', 'Paul'])
>>> is_name_in_text('I saw James today', names)
False
>>> is_name_in_text('I saw James John today', names)
'James John'
>>> is_name_in_text('I met Paul', names)
'Paul'
Build a regular expression with all the alternatives. This way you don't have to worry about somehow pulling the names out of the phrases beforehand.
import re
names_re = re.compile(r'\b' +
r'\b|\b'.join(re.escape(name) for name in names) +
r'\b')
print names_re.search('I saw James today')
You may use Python's set in order to get good performance while using the in operator.
If you have a mechanism of pulling the names out of the phrases and don't need to worry about partial matches (the full name will always be in the string), you can use a set rather than a list.
Your code is exactly the same, with this addition at line 2:
names = set(names)
The in operation will now function much faster.

Categories