I have a huge pdf that is all very basic text on pages for invoices, I need to create a regex or 2 so when I split it I get the customer number and the invoice number to use in the file name. I am using python 3 and pypdf2 currently
text example of 2 of the pages:
Detailed Invoice Report
Starting 8/12/2015 and ending 8/11/2022
Company: (Multiple Companies) Printed by Robert S on 8/11/2022 1:26:46PM
Donna Contact Cust# Name: Customer A 1234
Customer A Invoice Date Invoice Name 8/12/2015 241849
Item Description Qty Price Extended Price
Credit ($810.00) 1 ($810.00) 1
Due Paid Total Total Taxes Subtotal
($810.00) ($810.00) $0.00 ($810.00)
Balance: ($810.00) $0.00 $0.00
8/11/2022 1:26:46PM Page 1 of 340977
Detailed Invoice Report
Starting 8/12/2015 and ending 8/11/2022
Company: (Multiple Companies) Printed by Robert S on 8/11/2022 1:26:46PM
Customer B Cust# Name: Customer B 45678
Customer B Invoice Date Invoice Name 8/12/2015 241850
Item Description Qty Price Extended Price
credit ($49.99) 1 ($49.99) 1
Due Paid Total Total Taxes Subtotal
($49.99) ($49.99) $0.00 ($49.99)
Balance: ($49.99) $0.00 $0.00
8/11/2022 1:26:46PM Page 2 of 340977
currently I have these 2 regex filters to get each one kind of but I do not know how to only keep the last groups match from them.
Note: the firstmatch regex is broken if the customer name has a number in it which is an edge case but not uncommon in the data
firstmatch=r"(Name:)(\D*)(\d+)"
secondmatch=r"(Name )(\d*.\d*.\d*..)(\d*)"
Each one is its own page and I would like the regex to be able to pull from the first one 1234 241849 and the second one 45678 241850
You could get both values using a capture matching the last digits on the line.
For the first pattern:
\bName:.*?\b(\d+)[^\d\n]*$
Explanation
\bName: Match Name: preceded by a word boundary
.*? Match any character without a newline, as least as possible
\b(\d+) A word boundary, then capture 1+ digits in group 1
[^\d\n]* Optionally match any character except digits or a newline
$ End of string
Regex demo
For the second pattern you can make it a bit more specific, where [^\S\n]+ matches 1+ whitespace chars without newlines:
\bName[^\S\n]+\d+/\d+/\d+[^\S\n]+(\d+)[^\d\n]*$
Regex demo
Or if the lines are right behind each other, you can also use 1 pattern with 2 capture groups and match the newline at the end of the first line:
\bName:.*?\b(\d+)[^\d\n]*\n\b.*?Name[^\S\n]+\d+/\d+/\d+[^\S\n]+(\d+)[^\d\n]*$
Regex demo
I have a data frame like as shown below
df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,11,11],
'text':['inJECTable 1234 Eprex DOSE 4000 units on NONd',
'department 6789 DOSE 8000 units on DIALYSIS days - IV Interm',
'inJECTable 4321 Eprex DOSE - 3 times/wk on NONdialysis day',
'insulin MixTARD 30/70 - inJECTable 46 units',
'insulin ISOPHANE -- InsulaTARD Vial - inJECTable 56 units SC SubCutaneous',
'1-alfacalcidol DOSE 1 mcg - 3 times a week - IV Intermittent',
'jevity liquid - FEEDS PO Jevity - 237 mL - 1 times per day',
'1-alfacalcidol DOSE 1 mcg - 3 times per week - IV Intermittent',
'1-supported DOSE 1 mcg - 1 time/day - IV Intermittent',
'1-testpackage DOSE 1 mcg - 1 time a day - IV Intermittent']})
I would like to remove the words/strings which follow patterns such as 46 units, 3 times a week, 3 times per week, 1 time/day etc.
I was reading about positive and negative look ahead and behind.
So, was trying something like below
[^([0-9\s]*(?=units))] #to remove terms like `46 units` from the string
[^[0-9\s]*(?=times)(times a day)] # don't know how to make this work for all time variants
time variants ex: 3 times a day, 3 time/wk, 3 times per day, 3 times a month, 3 times/month etc.
Basically, I expect my output to be something like below (remove terms like xx units, xx time a day, xx times per week, xx time/day, xx time/wk, xx time/week, xx times per week, etc)
You can consider a pattern like
\s*\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))
See the regex demo
NOTE: the \d+ matches one or more digits. If you need to match any number, please consider using other patterns for a number in the format you expect, see regular expression for finding decimal/float numbers?, for example.
Pattern details
\s* - zero or more whitespace chars
\d+ - one or more digits
\s* - zero or more whitespaces
(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?)) - a non-capturing group matching:
units? - unit or units
| - or
times? - time or times
(?:\s+(?:a|per)\s+|\s*/\s*) - a or per enclosed with 1+ whitespaces, or / enclosed with 0+ whitespaces
(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?) - d or day, or wk or week, or month, or y/yea/yr
If you need to match whole words only, use word boundaries, \b:
\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b
In Pandas, use
df['text'] = df['text'].str.replace(r'\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b', '')
I'm working on project which require to extract all the case number from the given string. Can anyone please help me to create a regex to match the pattern for all the case numbers.
Pattern is like: alphanumeric must followed with / alphanumeric must followed with / alphanumeric
*Housekeeping Services For the period( 1‐03‐2020 to 31‐03‐2020) ‐ HDC ‐5i
SL.NO HSN/SAC
Code UOM
Facility
Approved
HC
Total Billing
Hours
Actual Manpower
HC
Unit Rate Per
Month Taxable Value
1 HK Supervisor 9985 HR 4 832 4.00 18,644.00 7 4,576.00*
Case no.**MH20/00285/VAS**
Case no. **MH20/00294/GVN1**
Case no. **MH20/000026/MUMR**
Case no. **KA20/00346/BN**
Case no. **DL20/0024/DLH39**
Case no. **MH20/003B30/GUR2**
Case no. **GJ20/001A75/GJ**
Case no. **GJ20/001A77/GJ**
Case no. **MH20/002CK89/GVN1**
*3,15,962.69
2 8,436.64
2 8,436.64
3,72,836.00
AMOUNT IN WORDS:‐ Rupees Three Lakhs Seventy Two Thousand Eight Hundred Thirty Six Only*
This one should do the Job
[\d\w]{4}/[\d\w]+/[\d\w]+
I'm trying to write a regex for removing text within brackets () or []. But, only places where it's not numbers with a percent symbol. Also, to remove the farthest bracket.
2.1.1. Berlin (/bɜːrˈlɪn/; German: [bɛʁˈliːn] (About this soundlisten)) is the capital and largest city of Germany by both area and population.[5][6] Its 3,769,495 (2019)[2] inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration.[40] Many other immigrants came from Bohemia, Poland, and Salzburg.
What I have now is removing everything between the brackets. But not considering the far end of the bracket.
re.sub("[\(\[].*?[\)\]]", "", sentence).strip()
You may remove all substrings between nested square brackets and remove all substrings inside parentheses except those with a number and a percentage symbol inside with
import re
def remove_text_nested(text, pattern):
n = 1 # run at least once
while n:
text, n = re.subn(pattern, '', text) # remove non-nested/flat balanced parts
return text
text = "Berlin (/bɜːrˈlɪn/; German: [bɛʁˈliːn] (About this soundlisten)) is the capital and largest city of Germany by both area and population.[5][6] Its 3,769,495 (2019)[2] inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration.[40] Many other immigrants came from Bohemia, Poland, and Salzburg."
text = remove_text_nested(text, r'\((?!\d+%\))[^()]*\)')
text = remove_text_nested(text, r'\[[^][]*]')
print(text)
Output:
Berlin is the capital and largest city of Germany by both area and population. Its 3,769,495 inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration. Many other immigrants came from Bohemia, Poland, and Salzburg.
See the Python demo
Basically, the remove_text_nested method removes all matches in a loop until no replacement occurs.
The \((?!\d+%\))[^()]*\) pattern matches (, then fails the match if there are 1+ digits, %) to the right of the current location, then matches 0+ chars other than ( and ) and then matches ). See this regex demo.
The \[[^][]*] pattern simply matches [, then 0 or more chars other than [ and ] and then a ]. See the regex demo.
I am relatively new to regex (always struggled with it for some reason)...
I have text that is of this form:
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
Parsing the text reveals the following structure:
Two or more words beginning the sentence, and before the first comma, is the name of the person involved in the transaction
One or more words before ('sold'|'bought'|'exercised'|'sold post-exercise') is the title of the person
Presence of either one of these: ('sold'|'bought'|'exercised'|'sold post-exercise') AFTER the title, identifies the transaction type
first numeric string following the transaction type ('sold'|'bought'|'exercised'|'sold post-exercise') denotes the size of the transaction
'price of ' preceeds a numeric string, which specifies the price at which the deal was struck.
My question is:
How can I use this knowledge (and regex), to write a function that parses similar text to return the variables of interest (listed 1 - 5 above)?
Pseudo code for the function I want to write ..
def grok_directors_dealings_text(text_input):
name, title, transaction_type, lot_size, price = (None, None, None, None, None)
....
name = ...
title = ...
transaction_type = ...
lot_size = ...
price = ...
pass
How would I use regex to implement the functions to return the variables of interest when passed in text that conforms to the structure I have identified above?
[[Edit]]
For some reason, I have seemed to struggle with regex for a while, if I am to learn from the correct answer here on S.O, it will be much better, if an explanation is offered as to why the magical expression (sorry, regexpr) actually works.
I want to actually learn this stuff instead of copy pasting expressions ...
You can use the following regex:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
DEMO
Python:
import re
financialData = """
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
"""
print(re.findall('(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)',financialData))
Output:
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00p'), ('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75p'), ('Albert Ellis', 'CEO', 'bought', '262', '52.00p')]
EDIT 1
To understand how and what they mean, follow the DEMO link,on top right you can find a block explaining what each and every character means as follows:
Also Debuggex helps you simulate the string by showing what group matches which characters!
Here's a debuggex demo for your particular case:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
Debuggex Demo
I came up with this regex:
([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p
Debuggex Demo
Basically, we are using the parenthesis to capture the important info you want so let's check it out each one:
([\w ]+): \w matches any word character [a-zA-Z0-9_] one or more times, this will give us the name of the person;
([\w ]+)Another one of these after a space and comma to get the title;
(sold post-exercise|sold|bought|exercised) then we search for our transaction types. Notice I put the post-exercise before the post so that it tries to match the bigger word first;
([\d,\.]+) Then we try to find the numbers, which are made of digits (\d), a comma and probbably a dot may appear as well;
([\d\.,]+) Then we need to get to the price which is basically the same as the size of the transaction.
The regex that connects each group are pretty basic as well.
If you try it on regex101 it provides some explanation about the regex and generates this code in python to use:
import re
p = re.compile(ur'([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p')
test_str = u"David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...\n\nMark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...\n\nAlbert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by..."
re.findall(p, test_str)
You can use the following regex that just looks for characters surrounding the delimiters:
(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p
The parts in parentheses will be captured as groups.
>>> import re
>>> l = ['''David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...''', '''Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...''', '''Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...''']
>>> for s in l:
... print(re.findall(r'(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p', s))
...
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00')]
[('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75')]
[('Albert Ellis', 'CEO', 'bought', '262', '52.00')]
this is the regex that will do it
(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)
you use it like this
import re
def get_data(line):
pattern = r"(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)"
m = re.match(pattern, line)
return m.groups()
for the first line this will return
('David Meredith', ' Financial Director ', 'sold post-exercise', '15,000', '1044.00')
EDIT:
adding explanation
this regex works as follows
the first characters (.*?), mean - take the string until the next match(witch is the ,)
. means every character
the * means that it can be many times (many characters and not just 1)
? means dont be greedy, that means that it will use the first ',' and another one (if there are many ',')
after that there is this again (.*?)
again take the characters until the next thing to match (with is the constant words)
after that there is (sold post-exercise|sold|bought|exercised) witch means - find one of the words (sperated by | )
after that there is a .*? witch again means take all text until next match (this time it is not surounded by () so it wont be selected as a group and wont be part of the output)
([\d|,]+) means take a digit (\d) or a comma. the + stands for one or more times
again .*? like before
'price of ' finds the actual string 'price of '
and last ([\d|.]+) means again take a digit or a dot (escaped because the character . is used by regex for 'any character') one or more times