I am trying to sort data coming from an online plain text government report that looks something like this:
Potato Prices as of 24-SEP-2014
Idaho
BrownSpuds
SomeSpuds 1.90-3.00 mostly 2.00-2.50
MoreSpuds 2.50-3.50
LotofSpuds 5.00-6.50
Washington
RedSpuds
TinyReds 1.50-2.00
BigReds 2.00-3.50
BrownSpuds
SomeSpuds 1.50-2.50
MoreSpuds 3.00-3.50
LotofSpuds 5.50-6.50
BulkSpuds 1.00-2.50
Long Island
SomeSpuds 1.50-2.50 MoreSpuds 2.70-3.75 LotofSpuds 5.00-6.50
etc...
I included the inconsistent indents and line breaks intentionally. This is a government operation.
But I need a function that can look up the price for "MoreSpuds" in Idaho, for example, or "TinyReds" in Washington. I have an inkling that this is a job for Regex, but I can't figure out how to search multiple lines between "Idaho" and "Washington".
EDIT: Adding the following difficulty. A particular item isn't always present in a given state. For example, "RedSpuds" in Washington might go out of season before "RedSpuds" in another state. I need the search to end before it reaches the next state, giving me no price at all, if the item isn't listed.
I also just ran into a case where the prices were written in a paragraph instead of a list. Sort of like the last example, but the actual product names are a lot longer, such as "One baled 10 5-lb sacks sz A 10.00-10.50" so some of the names get split between lines, meaning there might be a newline anywhere in the middle of the name.
Use DOTALL modifier (?s) to make dot to match even new line characters also.
>>> import re
>>> s = """Potato Prices as of 24-SEP-2014
... Idaho
... BrownSpuds
... SomeSpuds 1.90-3.00 mostly 2.00-2.50
... MoreSpuds 2.50-3.50
... LotofSpuds 5.00-6.50
...
... Washington
...
... RedSpuds
... TinyReds 1.50-2.00
... BigReds 2.00-3.50
... BrownSpuds
... SomeSpuds 1.50-2.50
... MoreSpuds 3.00-3.50
... LotofSpuds 5.50-6.50
... BulkSpuds 1.00-2.50
...
... Long Island
... SomeSpuds 1.50-2.50 MoreSpuds 2.70-3.75 LotofSpuds 5.00-6.50"""
To get the price of MoreSpuds in Idaho,
>>> m = re.search(r'(?s)\bIdaho\n*(?:(?!\n\n).)*?MoreSpuds\s+(\S+)', s)
>>> m.group(1)
'2.50-3.50'
To get the price of TinyReds in Washington,
>>> m = re.search(r'(?s)\bWashington\n*(?:(?!\n\n).)*?TinyReds\s+(\S+)', s)
>>> m.group(1)
'1.50-2.00'
DEMO
Pattern Explanation:
(?s) DOTALL modifier.
\b Word boundary which matches between a word and non-word character.
Washington City name.
\n* Matches zero or more new line characters.
(?:(?!\n\n).)*? This negative lookahead within a non-capturing group asserts that match any but not of a \n\n(a blank line). ? after the * forces the regex engine to do a shortest possible match.
TinyReds Product name.
\s+ Matches one or more space characters.
(\S+) Following one or more non-space characters are captured into group 1.
Related
I have a list of abbreviations that I am trying to find in my text using regex. However I am struggling to find adjacent words by matching letters and have only achieved this with word matching. Here is my text
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
My list is as such: [USNS, UGF, NFZ, AR]
I would like to find the corresponding long forms in the text using the first letter of each abbreviation. It would also need to be non-case sensitive. My attempt so far has been as such:
re.search(r'\bUnited\W+?States\b\W+?Navy\b\W+?Seals\b', text)
which returns United States Navy Seals however when I try and just use the first letter:
re.search(r'\bU\W+?S\b\W+?N\b\W+?S\b', text)
It then returns nothing. Furthermore some of the abbreviations contain more than the initial of a word in the text such as UGF - underground facility.
My actual goal is to eventually replace all abbreviations in the text (USNS, UGF, NFZ, AR) with their corresponding long forms (United States Navy Seals, underground facility, no-fly-zone, assault-rifle).
In your last regex [1]
re.search(r'\bU\W+?S\b\W+?N\b\W+?S\b', text)
you get no match because you made several mistakes:
\w+ means one or more word characters, \W+ is for one or more non-word characters.
the \b boundary anchor is sometimes in the wrong place (i.e. between the initial letter and the rest of the word)
re.search(r'\bU\w+\sS\w+?\sN\w+?\sS\w+', text)
should match.
And, well,
print(re.search(r'\bu\w+?g\w+\sf\w+', text))
matches of course underground facility but in a long text, there will be much more irrelevant matches.
Approach to generalization
Finally I built a little "machine" that dynamically creates regular expressions from the known abbreviations:
import re
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
abbrs = ['USNS', 'UGF', 'NFZ', 'AR']
for abbr in abbrs:
pattern = ''.join(map(lambda i: '['+i.upper()+i.lower()+'][a-z]+[ a-z-]', abbr))
print(pattern)
print(re.search(pattern, text, flags=re.IGNORECASE))
The output of above script is:
[Uu][a-z]+[ a-z-][Ss][a-z]+[ a-z-][Nn][a-z]+[ a-z-][Ss][a-z]+[ a-z-]
<re.Match object; span=(20, 45), match='United States Navy Seals '>
[Uu][a-z]+[ a-z-][Gg][a-z]+[ a-z-][Ff][a-z]+[ a-z-]
<re.Match object; span=(89, 110), match='underground facility '>
[Nn][a-z]+[ a-z-][Ff][a-z]+[ a-z-][Zz][a-z]+[ a-z-]
<re.Match object; span=(140, 152), match='no-fly-zone '>
[Aa][a-z]+[ a-z-][Rr][a-z]+[ a-z-]
<re.Match object; span=(170, 184), match='assault-rifle '>
Further generalization
If we assume that in a text each abbreviation is introduced after the first occurrence of the corresponding long form, and we further assume that the way it is written definitely starts with a word boundary and definitely ends with a word boundary (no assumptions about capitalization and the use of hyphens), we can try to extract a glossary automatically like this:
import re
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
# build a regex for an initial
def init_re(i):
return f'[{i.upper()+i.lower()}][a-z]+[ -]??'
# build a regex for an abbreviation
def abbr_re(abbr):
return r'\b'+''.join([init_re(i) for i in abbr])+r'\b'
# build an inverse glossary from a text
def inverse_glossary(text):
abbreviations = set(re.findall('\([A-Z]+\)', text))
igloss = dict()
for pabbr in abbreviations:
abbr = pabbr[1:-1]
pattern = '('+abbr_re(abbr)+') '+r'\('+abbr+r'\)'
m = re.search(pattern, text)
if m:
longform = m.group(1)
igloss[longform] = abbr
return igloss
igloss = inverse_glossary(text)
for long in igloss:
print('{} -> {}'.format(long, igloss[long]))
The output is
no-fly-zone -> NFZ
United States Navy Seals -> USNS
assault-rifle -> AR
underground facility -> UGF
By using an inverse glossary you may easily replace all long forms into their corresponding abbreviation. A bit harder is it to do for all but the first occurrence. There is much space for refinement, for example to correctly handle line breaks within long forms (also to use re.compile).
As to replace the abbreviations with the long forms, you have to build a normal glossary instead of an inverse one:
# build a glossary from a text
def glossary(text):
abbreviations = set(re.findall('\([A-Z]+\)', text))
gloss = dict()
for pabbr in abbreviations:
abbr = pabbr[1:-1]
pattern = '('+abbr_re(abbr)+') '+r'\('+abbr+r'\)'
m = re.search(pattern, text)
if m:
longform = m.group(1)
gloss[abbr] = longform
return gloss
gloss = glossary(text)
for abbr in gloss:
print('{}: {}'.format(abbr, gloss[abbr]))
The output here is
AR: assault-rifle
NFZ: no-fly-zone
UGF: underground facility
USNS: United States Navy Seals
The replacement itself is left to the reader.
[1]
Let's take a closer look at your first regex again:
re.search(r'\bUnited\W+?States\b\W+?Navy\b\W+?Seals\b', text)
The boundary anchors (\b) are redundant. They can be removed without changing anything in the result because \W+? means at least one non-word character after the last character of States and Navy. They cause no problems here but I guess that they led to the confusion when you started by modifying from it to get a more general one.
You could use the below regex which would take care of the case sensitivity as well. Click here.
This would just find United States Navy Seals.
\s[u|U].*?[s|S].*?[n|N].*?[s|S]\w+
Similarly, for UF,
You can use - \s[u|U].*?[g|G].*?[f|F]\w+
Please find a pattern above. The characters are just joined with .*? and each character is used as [a|A] which would match either lower case or upper case. The start would be \s since it should be a word and the end would \w+.
Play around.
I have a doubt about regex with backreference.
I need to match strings, I try this regex (\w)\1{1,} to capture repeated values of my string, but this regex only capture consecutive repeated strings; I'm stuck to improve my regex to capture all repeated values, below some examples:
import re
str = 'capitals'
re.search(r'(\w)\1{1,}', str)
Output None
import re
str = 'butterfly'
re.search(r'(\w)\1{1,}', str)
<_sre.SRE_Match object; span=(2, 4), match='tt'>
I would use r'(\w).*\1 so that it allows any repeated character even if there are special characters or spaces in between.
However this wont work for strings with repeated characters overlapping the contents of groups like the string abcdabcd, in which it only recognizes the first group, ignoring the other repeated characters enclosed in the first group (b,c,d)
Check the demo: https://regex101.com/r/m5UfAe/1
So an alternative (and depending on your needs) is to sort the string analyzed:
import re
str = 'abcdabcde'
re.findall(r'(\w).*\1', ''.join(sorted(str)))
returning the array with the repeated characters ['a','b','c','d']
Hope the code below will help you understand the Backreference concept of Python RegEx
There are two sets of information available in the given string str
Employee Basic Info:
starting with #employeename and ends with employeename
eg: #daniel dxc chennai 45000 male daniel
Employee designation
starting with %employeename then designation and ends with employeename%
eg: %daniel python developer daniel%
import re
#sample input
str="""
#daniel dxc chennai 45000 male daniel #henry infosys bengaluru 29000 male hobby-
swimming henry
#raja zoho chennai 37000 male raja #ramu infosys bengaluru 99000 male hobby-badminton
ramu
%daniel python developer daniel% %henry database admin henry%
%raja Testing lead raja% %ramu Manager ramu%
"""
#backreferencing employee name (\w+) <---- \1
#----------------------------------------------
basic_info=re.findall(r'#+(\w+)(.*?)\1',str)
print(basic_info)
#(%) <-- \1 and (\w+) <--- \2
#-------------------------------
designation=re.findall(r'(%)+(\w+)(.*?)\2\1',str)
print(designation)
for i in range(len(designation)):
designation[i]=(designation[i][1],designation[i][2])
print(designation)
For this regular expression:
(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]
I want the input string to be split by the captured matching \s character - the green matches as seen over here.
However, when I run this:
import re
p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]')
test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
re.split(p, test_str)
It seems to split the string at the regions given by [.?!]+ and [A-Z0-9] (thus incorrectly omitting them) and leaves \s in the results.
To clarify:
Input: he paid a lot for it. Did he mind
Received Output: ['he paid a lot for it','\s','id he mind']
Expected Output: ['he paid a lot for it.','Did he mind']
You need to remove the capturing group from around (\s) and put the last character class into a look-ahead to exclude it from the match:
p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+\s(?=[A-Z0-9])')
# ^^^^^ ^
test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.split(test_str))
See IDEONE demo and the regex demo.
Any capturing group in a regex pattern will create an additional element in the resulting array during re.split.
To force the punctuation to appear inside the "sentences", you can use this matching regex with re.findall:
import re
p = re.compile(r'\s*((?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+)')
test_str = "Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.findall(test_str))
See IDEONE demo
Results:
['Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it.', 'Did he mind?', "Adam Jones Jr. thinks he didn't.", "In any case, this isn't true...", "Well, with a probability of .9 it isn't.23 is the ish.", 'My name is!', "Why wouldn't you... this is.", 'Andrew']
The regex demo
The regex follows the rules in your original pattern:
\s* - matches 0 or more whitespace to omit from the result
(?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+) - 2 aternatives that are captured and returned by re.findall:
(?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])* - 0 or more sequences of...
(?:Mr|Dr|Ms|Jr|Sr)\. - abbreviated titles
\.(?!\s+[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then uppercase letters or digits
[^.!?] - any character but a ., !, and ?
or...
[^.!?]+ - any one or more characters but a ., !, and ?
I am trying to match key-value pairs that appear at the end of (long) strings. The strings look like (I replaced the "\n")
my_str = "lots of blah
key1: val1-words
key2: val2-words
key3: val3-words"
so I expect matches "key1: val1-words", "key2: val2-words" and "key3: val3-words".
The set of possible key names is known.
Not all possible keys appear in every string.
At least two keys appear in every string (if that makes it easier to match).
val-words can be several words.
key-value pairs should only be matched at the end of string.
I am using Python re module.
I was thinking re.compile('(?:tag1|tag2|tag3):')
plus some look-ahead assertion stuff would be a solution. I can't get it right though. How do I do?
Thank you.
/David
Real example string:
my_str = u'ucourt métrage pour kino session volume 18\nThème: O sombres héros\nContraintes: sous titrés\nAuthor: nicoalabdou\nTags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise\nPosted: 06 June 2009\nRating: 1.3\nVotes: 3'
EDIT:
Based on Mikel's solution I am now using the following:
my_tags = ['\S+'] # gets all tags
my_tags = ['Tags','Author','Posted'] # selected tags
regex = re.compile(r'''
\n # all key-value pairs are on separate lines
( # start group to return
(?:{0}): # placeholder for tags to detect '\S+' == all
\s # the space between ':' and value
.* # the value
) # end group to return
'''.format('|'.join(my_tags)), re.VERBOSE)
regex.sub('',my_str) # return my_str without matching key-vaue lines
regex.findall(my_str) # return matched key-value lines
The negative zero-width lookahead is (?!pattern).
It's mentioned part-way down the re module documentation page.
(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.
So you could use it to match any number of words after a key, but not a key using something like (?!\S+:)\S+.
And the complete code would look like this:
regex = re.compile(r'''
[\S]+: # a key (any word followed by a colon)
(?:
\s # then a space in between
(?!\S+:)\S+ # then a value (any word not followed by a colon)
)+ # match multiple values if present
''', re.VERBOSE)
matches = regex.findall(my_str)
Which gives
['key1: val1-words ', 'key2: val2-words ', 'key3: val3-words']
If you print the key/values using:
for match in matches:
print match
It will print:
key1: val1-words
key2: val2-words
key3: val3-words
Or using your updated example, it would print:
Thème: O sombres héros
Contraintes: sous titrés
Author: nicoalabdou
Tags: wakatanka productions court métrage kino session humour cantat bertrand noir désir sombres héros mer medine marie trintignant femme droit des femmes nicoalabdou pute soumise
Posted: 06 June 2009
Rating: 1.3
Votes: 3
You could turn each key/value pair into a dictionary using something like this:
pairs = dict([match.split(':', 1) for match in matches])
which would make it easier to look up only the keys (and values) you want.
More info:
Python re module documentation
Python Regular Expression HOWTO
Perl Regular Expression Reference "perlreref"
I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?
It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space
text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)