Regex: Negative Lookaheads for key-value parsing [duplicate] - python

I am trying to match key-value pairs that appear at the end of (long) strings. The strings look like (I replaced the "\n")
my_str = "lots of blah
key1: val1-words
key2: val2-words
key3: val3-words"
so I expect matches "key1: val1-words", "key2: val2-words" and "key3: val3-words".
The set of possible key names is known.
Not all possible keys appear in every string.
At least two keys appear in every string (if that makes it easier to match).
val-words can be several words.
key-value pairs should only be matched at the end of string.
I am using Python re module.
I was thinking re.compile('(?:tag1|tag2|tag3):')
plus some look-ahead assertion stuff would be a solution. I can't get it right though. How do I do?
Thank you.
/David
Real example string:
my_str = u'ucourt métrage pour kino session volume 18\nThème: O sombres héros\nContraintes: sous titrés\nAuthor: nicoalabdou\nTags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise\nPosted: 06 June 2009\nRating: 1.3\nVotes: 3'
EDIT:
Based on Mikel's solution I am now using the following:
my_tags = ['\S+'] # gets all tags
my_tags = ['Tags','Author','Posted'] # selected tags
regex = re.compile(r'''
\n # all key-value pairs are on separate lines
( # start group to return
(?:{0}): # placeholder for tags to detect '\S+' == all
\s # the space between ':' and value
.* # the value
) # end group to return
'''.format('|'.join(my_tags)), re.VERBOSE)
regex.sub('',my_str) # return my_str without matching key-vaue lines
regex.findall(my_str) # return matched key-value lines

The negative zero-width lookahead is (?!pattern).
It's mentioned part-way down the re module documentation page.
(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.
So you could use it to match any number of words after a key, but not a key using something like (?!\S+:)\S+.
And the complete code would look like this:
regex = re.compile(r'''
[\S]+: # a key (any word followed by a colon)
(?:
\s # then a space in between
(?!\S+:)\S+ # then a value (any word not followed by a colon)
)+ # match multiple values if present
''', re.VERBOSE)
matches = regex.findall(my_str)
Which gives
['key1: val1-words ', 'key2: val2-words ', 'key3: val3-words']
If you print the key/values using:
for match in matches:
print match
It will print:
key1: val1-words
key2: val2-words
key3: val3-words
Or using your updated example, it would print:
Thème: O sombres héros
Contraintes: sous titrés
Author: nicoalabdou
Tags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise
Posted: 06 June 2009
Rating: 1.3
Votes: 3
You could turn each key/value pair into a dictionary using something like this:
pairs = dict([match.split(':', 1) for match in matches])
which would make it easier to look up only the keys (and values) you want.
More info:
Python re module documentation
Python Regular Expression HOWTO
Perl Regular Expression Reference "perlreref"

Related

finding an element between a tag and a list of tags using regex

I want to find elements between two different tags but the catch is the first tag is constant but the second tag can be any tag belonging to a particular list.
for example a string
'TRSF BOOK TRANSFER CREDIT SND= abcd bank , 123 ORG= qwer123 OGB= qwerasd OBI= 123433'
I have a list of tags ['TRSF','SND=','ORG=','OGB=','OBI=']
edit : added the availability of '=' in the list itself
My output should look some what like this
TRSF : BOOK TRANSFER CREDIT
SND : abcd bank , 123
ORG : qwer123
OGB : qwerasd
OBI : 123433
The order of tags, as well as the availability of the tags, may change also new tags may come into the picture
till now I was writing separate regex and string parsing code for each type but that seems impractical as the combination can be infinite
Here is what I was doing :
org = re.findall("ORG=(.*?) OGB=",string_1)
snd = re.findall("SND=(.*?) ORG=",string_1)
,,obi = string_1.partition('OBI=')
Is there any way to do it like
<tag>(.*?)<tag in list>
or any other method ?
If the tag list is complete, you can use a regex like
\b(TRSF|SND|ORG|OGB|OBI)\b=?\s*(.*?)(?=\s*\b(?:TRSF|SND|ORG|OGB|OBI)\b|\Z)
See the regex demo. Details:
\b - a word boundary
(TRSF|SND|ORG|OGB|OBI) - a tag captured into Group 1
\b - a word boundary
=? - an optional =
\s* - 0+ whitespaces
(.*?) - Group 2: any zero or more chars, as few as possible
(?=\s*\b(?:TRSF|SND|ORG|OGB|OBI)\b|\Z) - either end of string (\Z) or zero or more whitespaces followed with a tag as a whole word.
See the Python demo:
import re
s='TRSF BOOK TRANSFER CREDIT SND= abcd bank , 123 ORG= qwer123 OGB= qwerasd OBI= 123433'
tags = ['TRSF','SND','ORG','OGB','OBI']
print( dict(re.findall(fr'\b({"|".join(tags)})\b=?\s*(.*?)(?=\s*\b(?:{"|".join(tags)})\b|\Z)', s.strip(), re.DOTALL)) )
# => {'TRSF': 'BOOK TRANSFER CREDIT', 'SND': 'abcd bank , 123', 'ORG': 'qwer123', 'OGB': 'qwerasd', 'OBI': '123433'}
Note the re.DOTALL (equal to re.S) makes the . match any chars including line break chars.

Removing varying text phrases through RegEx in a Python Data frame

Basically, I want to remove the certain phrase patterns embedded in my text data:
Starts with an upper case letter and ends with an Em Dash "—"
Starts with an Em Dash "—" and ends with a "Read Next"
Say, I've got the following data:
CEBU CITY—The widow of slain human rights lawyer .... citing figures from the NUPL that showed that 34 lawyers had been killed in the past two years. —WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next
and
Manila, Philippines—President .... but justice will eventually push its way through their walls of impunity, ... —REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next
I want to remove the following phrases:
"CEBU CITY—"
"—WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next"
"Manila, Philippines—"
"—REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next"
I am assuming this would be needing two regex for each patterns enumerated above.
The regex: —[A-Z].*Read Next\s*$ may work on the pattern # 2 but only when there are no other em dashes in the text data. It will not work when pattern # 1 occurs as it will remove the chunk from the first em dash it has seen until the "Read Next" string.
I have tried the following regex for pattern # 1:
^[A-Z]([A-Za-z]).+(—)$
But how come it does not work. That regex was supposed to look for a phrase that starts with any upper case letter, followed by any length of string as long as it ends with an "—".
What you are considering a hyphen - is not indeed a hyphen instead called Em Dash, hence you need to use this regex which has em dash instead of hyphen in start,
^—[A-Z].*Read Next\s*$
Here is the explanation for this regex,
^ --> Start of input
— --> Matches a literal Em Dash whose Unicode Decimal Code is 8212
[A-Z] --> Matches an upper case letter
.* --> Matches any character zero or more times
Read Next --> Matches these literal words
\s* --> This is for matching any optional white space that might be present at the end of line
$ --> End of input
Online demo
The regex that should take care of this -
^—[A-Z]+(.)*(Read Next)$
You can try implementing this regex on your data and see if it works out.

Regex to find name in sentence

I have some sentence like
1:
"RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held
ball is correctly called."
2:
"Nurkic (POR) maintains legal
guarding position and makes incidental contact with Wall (WAS) that
does not affect his driving shot attempt."
I need to use Python regex to find the name "Oubre Jr." ,"Nurkic" and "Nurkic", "Wall".
p = r'\s*(\w+?)\s[(]'
use this pattern,
I can find "['Nurkic', 'Wall']", but in sentence 1, I just can find ['Nurkic'], missed "Oubre Jr."
Who can help me?
You can use the following regex:
(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()
|-----Main Pattern-----|
Details:
(?:) - Creates a non-capturing group
[A-Z] - Captures 1 uppercase letter
[a-z] - Captures 1 lowercase letter
[\s\.a-z]* - Captures spaces (' '), periods ('.') or lowercase letters 0+ times
(?=\s\() - Captures the main pattern if it is only followed by ' (' string
str = '''RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called.
Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt.'''
res = re.findall( r'(?:[A-Z][a-z][\s\.a-z]*)+(?=\s\()', str )
print(res)
Demo: https://repl.it/#RahulVerma8/OvalRequiredAdvance?language=python3
Match: https://regex101.com/r/OsLTrY/1
Here is one approach:
line = "RLB shows Oubre Jr (WAS) legally ties up Nurkic (POR), and a held ball is correctly called."
results = re.findall( r'([A-Z][\w+'](?: [JS][r][.]?)?)(?= \([A-Z]+\))', line, re.M|re.I)
print(results)
['Oubre Jr', 'Nurkic']
The above logic will attempt to match one name, beginning with a capital letter, which is possibly followed by either the suffix Jr. or Sr., which in turn is followed by a ([A-Z]+) term.
You need a pattern that you can match - for your sentence you cou try to match things before (XXX) and include a list of possible "suffixes" to include as well - you would need to extract them from your sources
import re
suffs = ["Jr."] # append more to list
rsu = r"(?:"+"|".join(suffs)+")? ?"
# combine with suffixes
regex = r"(\w+ "+rsu+")\(\w{3}\)"
test_str = "RLB shows Oubre Jr. (WAS) legally ties up Nurkic (POR), and a held ball is correctly called. Nurkic (POR) maintains legal guarding position and makes incidental contact with Wall (WAS) that does not affect his driving shot attempt."
matches = re.finditer(regex, test_str, re.MULTILINE)
names = []
for matchNum, match in enumerate(matches,1):
for groupNum in range(0, len(match.groups())):
names.extend(match.groups(groupNum))
print(names)
Output:
['Oubre Jr.', 'Nurkic ', 'Nurkic ', 'Wall ']
This should work as long as you do not have Names with non-\w in them. If you need to adapt the regex, use https://regex101.com/r/pRr9ZU/1 as starting point.
Explanation:
r"(?:"+"|".join(suffs)+")? ?" --> all items in the list suffs are strung together via | (OR) as non grouping (?:...) and made optional followed by optional space.
r"(\w+ "+rsu+")\(\w{3}\)" --> the regex looks for any word characters followed by optional suffs group we just build, followed by literal ( then three word characters followed by another literal )

Python multiline regex search between sections

I am trying to sort data coming from an online plain text government report that looks something like this:
Potato Prices as of 24-SEP-2014
Idaho
BrownSpuds
SomeSpuds 1.90-3.00 mostly 2.00-2.50
MoreSpuds 2.50-3.50
LotofSpuds 5.00-6.50
Washington
RedSpuds
TinyReds 1.50-2.00
BigReds 2.00-3.50
BrownSpuds
SomeSpuds 1.50-2.50
MoreSpuds 3.00-3.50
LotofSpuds 5.50-6.50
BulkSpuds 1.00-2.50
Long Island
SomeSpuds 1.50-2.50 MoreSpuds 2.70-3.75 LotofSpuds 5.00-6.50
etc...
I included the inconsistent indents and line breaks intentionally. This is a government operation.
But I need a function that can look up the price for "MoreSpuds" in Idaho, for example, or "TinyReds" in Washington. I have an inkling that this is a job for Regex, but I can't figure out how to search multiple lines between "Idaho" and "Washington".
EDIT: Adding the following difficulty. A particular item isn't always present in a given state. For example, "RedSpuds" in Washington might go out of season before "RedSpuds" in another state. I need the search to end before it reaches the next state, giving me no price at all, if the item isn't listed.
I also just ran into a case where the prices were written in a paragraph instead of a list. Sort of like the last example, but the actual product names are a lot longer, such as "One baled 10 5-lb sacks sz A 10.00-10.50" so some of the names get split between lines, meaning there might be a newline anywhere in the middle of the name.
Use DOTALL modifier (?s) to make dot to match even new line characters also.
>>> import re
>>> s = """Potato Prices as of 24-SEP-2014
... Idaho
... BrownSpuds
... SomeSpuds 1.90-3.00 mostly 2.00-2.50
... MoreSpuds 2.50-3.50
... LotofSpuds 5.00-6.50
...
... Washington
...
... RedSpuds
... TinyReds 1.50-2.00
... BigReds 2.00-3.50
... BrownSpuds
... SomeSpuds 1.50-2.50
... MoreSpuds 3.00-3.50
... LotofSpuds 5.50-6.50
... BulkSpuds 1.00-2.50
...
... Long Island
... SomeSpuds 1.50-2.50 MoreSpuds 2.70-3.75 LotofSpuds 5.00-6.50"""
To get the price of MoreSpuds in Idaho,
>>> m = re.search(r'(?s)\bIdaho\n*(?:(?!\n\n).)*?MoreSpuds\s+(\S+)', s)
>>> m.group(1)
'2.50-3.50'
To get the price of TinyReds in Washington,
>>> m = re.search(r'(?s)\bWashington\n*(?:(?!\n\n).)*?TinyReds\s+(\S+)', s)
>>> m.group(1)
'1.50-2.00'
DEMO
Pattern Explanation:
(?s) DOTALL modifier.
\b Word boundary which matches between a word and non-word character.
Washington City name.
\n* Matches zero or more new line characters.
(?:(?!\n\n).)*? This negative lookahead within a non-capturing group asserts that match any but not of a \n\n(a blank line). ? after the * forces the regex engine to do a shortest possible match.
TinyReds Product name.
\s+ Matches one or more space characters.
(\S+) Following one or more non-space characters are captured into group 1.

Why doesn't this regular expression work in all cases?

I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?
It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space
text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)

Categories