Python Regex match all occurrences of decimal pattern followed by another pattern - python

I've done lots of searching, including this SO post, which almost worked for me.
I'm working with a huge string, trying to capture the groups of four digits that appear after a series of decimal patterns AND before an alphanumeric word.
There are other four digit number groups that don't qualify since they have words or other number patterns before them.
EDIT: my string is not multiline, it is just shown here for visual convenience.
For example:
>> my_string = """BEAVER COUNTY 001 0000
1010 BEAVER
2010 BEAVER COUNTY SCH DIST
0.008504
...(more decimals)
0.008508
4010 COUNTY SPECIAL SERVICE DIST NO.1 <---capture this 4010
4040 BEAVER COUNTY
8005 GREENVILLE SOLAR
0.004258
0.008348
...(more decimals)
0.008238
4060 SPECIAL SERVICE DISTRICT NO 7 <---capture this 4060
"""
The ideal re.findall should return:
['4010','4060']
Here are patterns I've tried that are lacking:
re.findall(r'(?=(\d\.\d{6}\s+)(\s+\d{4}\s))', my_string)
# also tried
re.findall("(\s+\d{4}\s+)(?:(?!^\d+\.\d+)[\s\S])*", my_string)
# which gets me a little closer but I'm still not getting what I need.
Thanks in advance!

SINGLE LINE STRING APPROACH:
Just match the float number right before the 4 standalone digits:
r'\d+\.\d+\s+(\d{4})\b'
See this regex demo
Python demo:
import re
p = re.compile(r'\d+\.\d+\s+(\d{4})\b')
s = "BEAVER COUNTY 001 0000 1010 BEAVER 2010 BEAVER COUNTY SCH DIST 0.008504 0.008508 4010 COUNTY SPECIAL SERVICE DIST NO.1 4040 BEAVER COUNTY 8005 GREENVILLE SOLAR 0.004258 0.008348 0.008238 4060 SPECIAL SERVICE DISTRICT NO 7"
print(p.findall(s))
# => ['4010', '4060']
ORIGINAL ANSWER: MULTILINE STRING
You may use a regex that will check for a float value on the previous line and then captures the standalone 4 digits on the next line:
re.compile(r'^\d+\.\d+ *[\r\n]+(\d{4})\b', re.M)
See regex demo here
Pattern explanation:
^ - start of a line (as re.M is used)
\d+\.\d+ - 1+ digits, . and again 1 or more digits
* - zero or more spaces (replace with [^\S\r\n] to only match horizontal whitespace)
[\r\n]+ - 1 or more LF or CR symbols (to only restrict to 1 linebreak, replace with (?:\r?\n|\r))
(\d{4})\b - Group 1 returned by the re.findall matching 4 digits followed with a word boundary (a non-digit, non-letter, non-_).
Python demo:
import re
p = re.compile(r'^\d+\.\d+ *[\r\n]+(\d{4})\b', re.MULTILINE)
s = "BEAVER COUNTY 001 0000 \n1010 BEAVER \n2010 BEAVER COUNTY SCH DIST \n0.008504 \n...(more decimals)\n0.008508 \n4010 COUNTY SPECIAL SERVICE DIST NO.1 <---capture this 4010\n4040 BEAVER COUNTY \n8005 GREENVILLE SOLAR\n0.004258 \n0.008348 \n...(more decimals)\n0.008238 \n4060 SPECIAL SERVICE DISTRICT NO 7 <---capture this 4060"
print(p.findall(s)) # => ['4010', '4060']

This will help you:
"((\d+\.\d+)\s+)+(\d+)\s?(?=\w+)"gm
use group three means \3
Demo And Explaination

Try this patter:
re.compile(r'(\d+[.]\d+)+\s+(?P<cap>\d{4})\s+\w+')
I wrote a little code and checked against it and it works.
import re
p=re.compile(r'(\d+[.]\d+)+\s+(?P<cap>\d{4})\s+\w+')
my_string = """BEAVER COUNTY 001 0000
1010 BEAVER
2010 BEAVER COUNTY SCH DIST
0.008504
...(more decimals)
0.008508
4010 COUNTY SPECIAL SERVICE DIST NO.1 <---capture this 4010
4040 BEAVER COUNTY
8005 GREENVILLE SOLAR
0.004258
0.008348
...(more decimals)
0.008238
4060 SPECIAL SERVICE DISTRICT NO 7 <---capture this 4060
"""
s=my_string.replace("\n", " ")
match=p.finditer(s)
for m in match:
print m.group('cap')

Related

Remove Duplicates from csv

I have a csv/txt file of following content:
Mumbai 2
Pune 6
Bangalore 8
Pune 10
Mumbai 8
and I want this in output file :
Mumbai 2,8
Pune 6,10
Bangalore 8
Note : Don't use any python modules, packages
Here is a possible solution:
import re
linepat = re.compile('''
^ \s*
(?:
(
[A-Za-z] \S*
(?: \s+ [A-Za-z] \S* )*
) \s+ ( [0-9]+ )
\s* $
)
|
(.*)
''', re.VERBOSE)
filtered = {}
# fill `filtered` from `duplicates.csv`
with open('duplicates.csv', 'r') as f:
for lnum, line in enumerate(f, start=1):
city, number, invalid = linepat.match(line).groups()
if not city:
invalid = invalid.strip()
if invalid:
raise Exception(f'line {lnum} has a wrong format:\n{line}')
else:
city = ' '.join(city.split())
if city not in filtered:
filtered[city] = set()
filtered[city].add(int(number))
# write `filtered` to `without_duplicates.csv`
with open('without_duplicates.csv', 'w') as f:
for city, numbers in filtered.items():
numbers = ','.join(str(num) for num in sorted(numbers))
f.write(f'{city} {numbers}\n')
# Mumbai 2
# Pune 6
# New York 15
#
# Bangalore 8
# Pune 10
# Mumbai 8
# New York 1
#
# -->
#
# Mumbai 2,8
# Pune 6,10
# New York 1,15
# Bangalore 8
It is not clear from your example how the numbers per line in the output shall be sorted. If you want them sorted by first occurrence in the input file, use a list instead of a set and do citynumbers = filtered[city]; if number not in citynumbers: citynumbers.append(number), and later have them not sorted().
The whitespaces which separate a city name from its number may also be part of the city name. Therefore the regex requires that every part of the city name starts with [a-zA-Z]. Cleaner is, to require that whitespaces are replaced or escaped in city names.
filtered in the code example could also be a defaultdict(set).
For many use cases, the csv module is the simpler approach.

Split Column on regex

I really struggle with regex, and I'm hoping for some help.
I have columns that look like this
import pandas as pd
data = {'Location': ['Building A, 100 First St City, State', 'Fire Station # 100, 2 Apple Row, City, State Zip', 'Church , 134 Baker Rd City, State']}
df = pd.DataFrame(data)
Location
0 Building A, 100 First St City, State
1 Fire Station # 100, 2 Apple Row, City, State Zip
2 Church , 134 Baker Rd City, State
I would like to get it to the code chunk below by splitting anytime there is a comma followed by space and then a number. However, I'm running into an issue where I'm removing the number.
Location Name Address
0 Building A 100 First St City, State
1 Fire Station # 100 2 Apple Row, City, State, Zip
2 Church 134 Baker Rd City, State
This is the code I've been using
df['Location Name']= df['Location'].str.split('.,\s\d', expand=True)[0]
df['Address']= df['Location'].str.split('.,\s\d', expand=True)[1]
You can use Series.str.extract:
df[['Location Name','Address']] = df['Location'].str.extract(r'^(.*?),\s(\d.*)', expand=True)
The ^(.*?),\s(\d.*) regex matches
^ - start of string
(.*?) - Group 1 ('Location Name'): any zero or more chars other than line break chars as few as possible
,\s - comma and whitespace
(\d.*) - Group 1 ('Address'): digit and the rest of the line.
See the regex demo.
Another simple solution to your problem is to use a positive lookahead. You want to check if there is a number ahead of your pattern, while not including the number in the match. Here's an example of a regex that solves your problem:
\s?,\s(?=\d)
Here, we optionally remove a trailing whitespace, then match a comma followed by whitespace.
The (?= ) is a positive lookahead, in this case we check for a following digit. If that's matched, the split will remove the comma and whitespace only.

How to delete a word and everything that follows using regex

County
Davis County
Ark County
Clay County Party
I want to delete County and everything that proceeds County from the County column. This is what I have tried so far.
def county(df):
df['County'].replace([r' County [a-z]*'], '', regex = True, inplace = True)
You can use . in place of [a-z] to match all characters or you can use [a-zA-Z ] to match only letters and spaces.
See Mozilla regex reference.

Python regex matching multiline string

my_str :
PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'
my code
regex = re.compile(r'(Applicants:)( )?(.*)', re.MULTILINE)
print(regex.findall(text))
my output :
[('Applicants:', ' ', 'Silixa Ltd.')]
what I need is to get the string between 'Applicants:' and '\nInventors:'
'Silixa Ltd.' & 'Chevron U.S.A. Inc. (Incorporated
in USA - California)'
Thanks in advance for your help
Try using re.DOTALL instead:
import re
text='''PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'''
regex = re.compile(r'Applicants:(.*?)Inventors:', re.DOTALL)
print(regex.findall(text))
gives me
$ python test.py
[' Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n\n']
The reason this works is that MULTILINE doesn't let the dot (.) match newlines, whereas DOTALL will.
If what you want is the contents between Applicants: and \nInventors:, your regex should reflect that:
>>> regex = re.compile(r'Applicants: (.*)Inventors:', re.S)
>>> print(regex.findall(s))
['Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n']
re.S is the "dot matches all" option, so our (.*) will also match new lines. Note that this is different from re.MULTILINE, because re.MULTILINE only says that our expression should apply to multiple lines, but doesn't change the fact . will not match newlines. If . doesn't match newlines, a match like (.*) will still stop at newlines, not achieving the multiline effect you want.
Also note that if you are not interested in Applicants: or Inventors: you may not want to put that between (), as in (Inventors:) in your regex, because the match will try to create a matching group for it. That's the reason you got 3 elements in your output instead of just 1.
If you want to match all the text between \nApplicants: and \nInventors:, you could also get the match without using re.DOTALL preventing unnecessary backtracking.
Match Applicants: and capture in group 1 the rest of that same line and all lines that follow that do not start with Inventors:
Then match Inventors.
^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:
^ Start of string (Or use \b if it does not have to be at the start)
Applicants: Match literally
( Capture group 1
.* Match the rest of the line
(?:\r?\n(?!Inventors:).*)* Match all lines that do not start with Inverntors:
) Close group
\r?\nInventors: Match a newline and Inventors:
Regex demo | Python demo
Example code
import re
text = ("PCT Filing Date: 2 December 2015\n"
"Applicants: Silixa Ltd.\n"
"Chevron U.S.A. Inc. (Incorporated\n"
"in USA - California)\n"
"Inventors: Farhadiroushan,\n"
"Mahmoud\n"
"Gillies, Arran\n"
"Parker, Tom'")
regex = re.compile(r'^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:', re.MULTILINE)
print(regex.findall(text))
Output
['Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)']
Here is a more general approach to parse a string like that into a dict of all the keys and values in it (ie, any string at the start of a line followed by a : is a key and the string following that key is data):
import re
txt="""\
PCT Filing Date: 2 December 2015
Applicants: Silixa Ltd.
Chevron U.S.A. Inc. (Incorporated
in USA - California)
Inventors: Farhadiroushan,
Mahmoud
Gillies, Arran
Parker, Tom'"""
pat=re.compile(r'(^[^\n:]+):[ \t]*([\s\S]*?(?=(?:^[^\n:]*:)|\Z))', flags=re.M)
data={m.group(1):m.group(2) for m in pat.finditer(txt)}
Result:
>>> data
{'PCT Filing Date': '2 December 2015\n', 'Applicants': 'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n', 'Inventors': "Farhadiroushan,\nMahmoud\nGillies, Arran\nParker, Tom'"}
>>> data['Applicants']
'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n'
Demo of the regex

How to extract first floating numbers appearing after a word?

I'm trying to build an application for text extraction use case but I was not able to extract exact price from it.
I have a text like this,
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
I want to extract the price that appearing after the word total using regex but I was only able to extract all floating numbers. Also do note some-times you may also see words such as sub total but I only need price that appears after the word total. Also sometimes after total there may occur other words as well. So Regex should match word total and extract the floating numbers that appears next to it.
Any help is appreciated.
This is what I've tried,
re.findall("\d+\.\d+", string1) # this returns all floating numbers.
You can try
(?<=\\nTotal)\:?\D+([\d\.]+)
Demo
You could try this, should work for the example and the other restrictions you mentioned
import re
result = re.search('Total\n\$(\d+.\d+)', string1)
result.group(1) # 191.44
result = re.search('Total\:\n.+\n(\d+.\d+)', string2)
result.group(1) # 54.50
EDIT: If you want only one expression for both, you could try
result = re.search('\nTotal\:?(\n\D+)*\n\$?(\d+.\d+)', string)
re.group(2)
You could use a positive lookbehind to prevent sub being before total, word boundaries to prevent the words being part of a larger word and a capturing group to capture the price.
(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))
In parts:
(?<!\bsub ) Negative lookbehind, assert what is on the left is not the word sub and a space
\btotal\b Match total between word boundaries to prevent it being part of a larger word
\D* Match 0+ times any char that is not a digit
( Capture group 1
\d+(?:\.\d+) Match 1+ digits with an optional decimal part
) Close group
Regex demo | Python demo
For example
import re
regex = r"(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))"
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
print(re.findall(regex, string1, re.IGNORECASE))
print(re.findall(regex, string2, re.IGNORECASE))
Output
['191.44']
['54.50']
If what precedes the price should be a dollar sign of the text CHF, you might use an alternation (?:\$|CHF)\s* matching of the values followed by matching 0+ whitespace chars:
(?<!\bsub )\btotal\b\D*(?:\$|CHF)\s*(\d+(?:\.\d+))
Regex demo
Something like this might do the trick:
(?<!sub )total.*?(\d+.\d+)
Make sure to ignore the case.

Categories