Good regex to extract price - python

I am trying to extract price from various currency values. Here are my sample input values:
でレンタル HD(高画質) ¥ 500
で購入  HD(高画質) ¥ 2,500
Buy SD £5.99
Buy SD £14.99
HD ausleihen EUR 3,99
HD kaufen EUR 11,99
Buy Movie HD $19.99
$1,200.84
How would I get this currency value into a float, for example 19.99 ? The regex I had so far is:
re.findall(r'[\d|\,|\.]+', s)[0].replace(',', '')
But it seems insufficient. What would be a better one?

A regex that will match ANY currencies from a string, before or after a currency type word/symbol, you may use
(?:\b(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)|[$\u00A2-\u00A5\u058F\u060B\u07FE\u07FF\u09F2\u09F3\u09FB\u0AF1\u0BF9\u0E3F\u17DB\u20A0-\u20C0\uA838\uFDFC\uFE69\uFF04\uFFE0\uFFE1\uFFE5\uFFE6\U00011FDD-\U00011FE0\U0001E2FF\U0001ECB0])\s*(\d+(?:[.,]\d+)*)|(\d+(?:[.,]\d+)*)\s*(?:(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)\b|[$\u00A2-\u00A5\u058F\u060B\u07FE\u07FF\u09F2\u09F3\u09FB\u0AF1\u0BF9\u0E3F\u17DB\u20A0-\u20C0\uA838\uFDFC\uFE69\uFF04\uFFE0\uFFE1\uFFE5\uFFE6\U00011FDD-\U00011FE0\U0001E2FF\U0001ECB0])
See the regex demo. It includes USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR pattern that matches most common world currencies and [$\u00A2-\u00A5\u058F\u060B\u07FE\u07FF\u09F2\u09F3\u09FB\u0AF1\u0BF9\u0E3F\u17DB\u20A0-\u20C0\uA838\uFDFC\uFE69\uFF04\uFFE0\uFFE1\uFFE5\uFFE6\U00011FDD-\U00011FE0\U0001E2FF\U0001ECB0] that matches any currency symbols (equivalent of \p{Sc} in PCRE).
In Python, you will need a bit of code to make it work as you need:
import re
texts = ['でレンタル HD(高画質) ¥ 500',
'で購入  HD(高画質) ¥ 2,500',
'Buy SD £5.99',
'Buy SD £14.99',
'HD ausleihen EUR 3,99',
'HD kaufen EUR 11,99',
'Buy Movie HD $19.99',
'$1,200.84'
]
curword = r'(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)'
cursymbol = r'[$\u00A2-\u00A5\u058F\u060B\u07FE\u07FF\u09F2\u09F3\u09FB\u0AF1\u0BF9\u0E3F\u17DB\u20A0-\u20C0\uA838\uFDFC\uFE69\uFF04\uFFE0\uFFE1\uFFE5\uFFE6\U00011FDD-\U00011FE0\U0001E2FF\U0001ECB0]'
num = r'\d+(?:[.,]\d+)*'
pattern = re.compile(fr'(?:\b{curword}|{cursymbol})\s*({num})|({num})\s*(?:{curword}\b|{cursymbol})')
print(fr'(?:\b{curword}|{cursymbol})\s*({num})|({num})\s*(?:{curword}\b|{cursymbol})')
for text in texts:
m = pattern.search(text)
if m:
result = m.group(1) or m.group(2)
print(result)
See the Python demo. It prints
500
2,500
5.99
14.99
3,99
11,99
19.99
1,200.84
If you need to convert string result to int/float, you can also capture the country currency word/symbol, then convert the decimal separator to the one you need and then parse to int or float.

Related

skipping a match in regex

I am trying to extract some number value from a text. Skipping is done based on a matching text.
For example :
Input Text -
ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST# 36479 GST percentage is 20%.
OR
ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg No. 36479 GST% is 20%.
OR
ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg# 36479 GST% is 20%.
Output Text -
Amount 400.00
GST 36479
GST 20%
Main point is input text can be in any format but output text should be same. One thing that will be same is GST Number will be non-decimal number, GST percentage will be number followed by "%" symbol and amount will be in decimal form.
I tried but not able to skip the non-numeric value after GST. Please help.
What I tried :
pattern = re.compile(r"\b(?<=GST).\D(\d+)")
You can use
\bAmount\s*(?P<amount>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_id>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_prcnt>\d+(?:\.\d+)?%)
See the regex demo. Details:
\bAmount\s* - a whole word Amount and zero or more whitespaces
(?P<amount>\d+(?:\.\d+)?) - Group "amount": one or more digits and then an optional sequence of . and one or more digits
.*? - some text (excluding whitespace)
\bGST - a word GST
\D* - zero or more chars other than digits
(?P<gst_id>\d+(?:\.\d+)?) - Group "gst_id": one or more digits and then an optional sequence of . and one or more digits
.*? - some text (excluding whitespace)
\bGST\D* - a word GST and then zero or more chars other than digits
(?P<gst_prcnt>\d+(?:\.\d+)?%) - Group "gst_prcnt": one or more digits and then an optional sequence of . and one or more digits, and then a % char.
See the Python demo:
import re
pattern = r"\bAmount\s*(?P<amount>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_id>\d+(?:\.\d+)?).*?\bGST\D*(?P<gst_prcnt>\d+(?:\.\d+)?%)"
texts = ["ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST# 36479 GST percentage is 20%.",
"ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg No. 36479 GST% is 20%.",
"ABC Company Export Items 4 Bought by XYZ Amount 400.00 with GST Reg# 36479 GST% is 20%."]
for text in texts:
m = re.search(pattern, text)
if m:
print(m.groupdict())
Output:
{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}
{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}
{'amount': '400.00', 'gst_id': '36479', 'gst_prcnt': '20%'}

Regex replace in Python picking a specific substring

Here's what I want to happen:
input = "asdsad,200200-12964,0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17"" 0.00000000,1.000000"
output = "asdsad,200200-12964,0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589.037.17"" 0.00000000,1.000000"
How can I change the comma (,) to a dot (.) between ""...589,037.17..."" in Python using regex.
Extra: 589,037.17 => 589.037.17
I tried:
print(re.sub(r'(?<=\d),', '.', input))
But my output was:
output = "asdsad,200200-12964.0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589.037.17"" 0.00000000,1.000000"
First, don't call a variable input, because it overwrites the the built-in function input(). Also you repeated strings are just one string in Python.
i = 'asdsad,200200-12964,0009,TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17 0.00000000,1.000000'
To solve your specific case, you could match a the country code followed by 3 numbers in the first bit of the price before the comma. That works for this, but probably isn't generic enough for any country code and any price, as look-behinds must be of fixed width.
print(re.sub(r'(?<=USD \d{3}),', '.', i))
I would use a look-behind for the country code and space, then group the first bit of the number and replace it with a backreference.
print(re.sub(r'(?<=[A-Z]{3} )(\d+),', r'\1.', i))
import re
input = "asdsad,200200-12964,0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17"" 0.00000000,1.000000"
print(input)
print(re.sub(r'USD (\d+),(\d+)', r'USD \1.\2', input))
Output:
asdsad,200200-12964,0009,TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17 0.00000000,1.000000
asdsad,200200-12964,0009,TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589.037.17 0.00000000,1.000000
You can go through this Search and Replace and this link for documenation on this.

How to extract first floating numbers appearing after a word?

I'm trying to build an application for text extraction use case but I was not able to extract exact price from it.
I have a text like this,
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
I want to extract the price that appearing after the word total using regex but I was only able to extract all floating numbers. Also do note some-times you may also see words such as sub total but I only need price that appears after the word total. Also sometimes after total there may occur other words as well. So Regex should match word total and extract the floating numbers that appears next to it.
Any help is appreciated.
This is what I've tried,
re.findall("\d+\.\d+", string1) # this returns all floating numbers.
You can try
(?<=\\nTotal)\:?\D+([\d\.]+)
Demo
You could try this, should work for the example and the other restrictions you mentioned
import re
result = re.search('Total\n\$(\d+.\d+)', string1)
result.group(1) # 191.44
result = re.search('Total\:\n.+\n(\d+.\d+)', string2)
result.group(1) # 54.50
EDIT: If you want only one expression for both, you could try
result = re.search('\nTotal\:?(\n\D+)*\n\$?(\d+.\d+)', string)
re.group(2)
You could use a positive lookbehind to prevent sub being before total, word boundaries to prevent the words being part of a larger word and a capturing group to capture the price.
(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))
In parts:
(?<!\bsub ) Negative lookbehind, assert what is on the left is not the word sub and a space
\btotal\b Match total between word boundaries to prevent it being part of a larger word
\D* Match 0+ times any char that is not a digit
( Capture group 1
\d+(?:\.\d+) Match 1+ digits with an optional decimal part
) Close group
Regex demo | Python demo
For example
import re
regex = r"(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))"
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
print(re.findall(regex, string1, re.IGNORECASE))
print(re.findall(regex, string2, re.IGNORECASE))
Output
['191.44']
['54.50']
If what precedes the price should be a dollar sign of the text CHF, you might use an alternation (?:\$|CHF)\s* matching of the values followed by matching 0+ whitespace chars:
(?<!\bsub )\btotal\b\D*(?:\$|CHF)\s*(\d+(?:\.\d+))
Regex demo
Something like this might do the trick:
(?<!sub )total.*?(\d+.\d+)
Make sure to ignore the case.

Filtering a number from different formats with Regex

I am trying to do some data analysis and there are some numbers that I want to analyze, the problem being that those numbers are in different string formats. These are the different formats:
"25,000,000 USD" or
"9 500 USD" or
"50,000 ETH"
It is basically always a number first, separated by either commas or blank spaces followed by the currency. Depending on the currency, i want to calculate the amount in USD afterwards.
I have looked up Regex expressions for the last hour and could not find anything that solves my problem. I definitely made some progress and implemented different expressions, but none worked 100%. It's always missing something as you will see below.
for i, row_value in df2['hardcap'].iteritems():
try:
q = df2['hardcap'][i]
c = re.findall(r'[a-zA-Z]+', q)
if c[0] == "USD":
d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
#Do something with the number
elif c[0] == "EUR":
d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
#Do something with the number
elif c[0] == "ETH":
d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
#Do something with the number
print(d[0])
except Exception:
pass
So I am iterating through my dataframe column and first, ill find out which currency the number is related to, either "USD", "EUR" or "ETH" which I save in c. This part already works, after that, i want to extract the number in a form that can be converted to an integer so I can do calculations with it.
Right now, the line
d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
returns something like this in d[0]:
('100,000,000', ',000') if the number was 100,000,000 and
('270', '') if the number was 270 000 000
What I would like to get in the best case would be something like:
100000000
and
270000000, but any way to extract the whole numbers would suffice
I'd appreciate any bump in the right direction as I don't have much experience with regex and feel stuck right now.
import re
s = '25,000,000 USD 9 500 USD 50,000 ETH'
for g in re.findall(r'(.*?)([A-Z]{3})', s):
print(int(''.join(re.findall(r'\d', g[0]))), g[1])
Prints:
25000000 USD
9500 USD
50000 ETH
Optimized solution with re.search + re.sub functions:
import re
# equivalent for your df2['hardcap'] column values
hardcap = ["25,000,000 USD", "9 500 USD", "50,000 ETH"]
pat = re.compile(r'^(\d[\s,\d]*\d) ([A-Z]{3})')
for v in hardcap:
m = pat.search(v)
if m: # if value is in the needed format
amount, currency = m.group(1), m.group(2)
amount = int(re.sub(r'\D*', '', amount))
print(amount, currency)
Sample output:
25000000 USD
9500 USD
50000 ETH
import re
s = '25,000,000 USD 9 500 USD 50,000 ETH'
matches = re.findall(r'(\d[\d, ]*) ([A-Z]{3})', s)
l = [(int(match[0].replace(',', '').replace(' ', '')), match[1]) for match in matches]
print(l)
[(25000000, 'USD'), (9500, 'USD'), (50000, 'ETH')]

Python regex to parse financial data

I am relatively new to regex (always struggled with it for some reason)...
I have text that is of this form:
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
Parsing the text reveals the following structure:
Two or more words beginning the sentence, and before the first comma, is the name of the person involved in the transaction
One or more words before ('sold'|'bought'|'exercised'|'sold post-exercise') is the title of the person
Presence of either one of these: ('sold'|'bought'|'exercised'|'sold post-exercise') AFTER the title, identifies the transaction type
first numeric string following the transaction type ('sold'|'bought'|'exercised'|'sold post-exercise') denotes the size of the transaction
'price of ' preceeds a numeric string, which specifies the price at which the deal was struck.
My question is:
How can I use this knowledge (and regex), to write a function that parses similar text to return the variables of interest (listed 1 - 5 above)?
Pseudo code for the function I want to write ..
def grok_directors_dealings_text(text_input):
name, title, transaction_type, lot_size, price = (None, None, None, None, None)
....
name = ...
title = ...
transaction_type = ...
lot_size = ...
price = ...
pass
How would I use regex to implement the functions to return the variables of interest when passed in text that conforms to the structure I have identified above?
[[Edit]]
For some reason, I have seemed to struggle with regex for a while, if I am to learn from the correct answer here on S.O, it will be much better, if an explanation is offered as to why the magical expression (sorry, regexpr) actually works.
I want to actually learn this stuff instead of copy pasting expressions ...
You can use the following regex:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
DEMO
Python:
import re
financialData = """
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
"""
print(re.findall('(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)',financialData))
Output:
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00p'), ('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75p'), ('Albert Ellis', 'CEO', 'bought', '262', '52.00p')]
EDIT 1
To understand how and what they mean, follow the DEMO link,on top right you can find a block explaining what each and every character means as follows:
Also Debuggex helps you simulate the string by showing what group matches which characters!
Here's a debuggex demo for your particular case:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
Debuggex Demo
I came up with this regex:
([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p
Debuggex Demo
Basically, we are using the parenthesis to capture the important info you want so let's check it out each one:
([\w ]+): \w matches any word character [a-zA-Z0-9_] one or more times, this will give us the name of the person;
([\w ]+)Another one of these after a space and comma to get the title;
(sold post-exercise|sold|bought|exercised) then we search for our transaction types. Notice I put the post-exercise before the post so that it tries to match the bigger word first;
([\d,\.]+) Then we try to find the numbers, which are made of digits (\d), a comma and probbably a dot may appear as well;
([\d\.,]+) Then we need to get to the price which is basically the same as the size of the transaction.
The regex that connects each group are pretty basic as well.
If you try it on regex101 it provides some explanation about the regex and generates this code in python to use:
import re
p = re.compile(ur'([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p')
test_str = u"David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...\n\nMark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...\n\nAlbert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by..."
re.findall(p, test_str)
You can use the following regex that just looks for characters surrounding the delimiters:
(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p
The parts in parentheses will be captured as groups.
>>> import re
>>> l = ['''David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...''', '''Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...''', '''Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...''']
>>> for s in l:
... print(re.findall(r'(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p', s))
...
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00')]
[('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75')]
[('Albert Ellis', 'CEO', 'bought', '262', '52.00')]
this is the regex that will do it
(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)
you use it like this
import re
def get_data(line):
pattern = r"(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)"
m = re.match(pattern, line)
return m.groups()
for the first line this will return
('David Meredith', ' Financial Director ', 'sold post-exercise', '15,000', '1044.00')
EDIT:
adding explanation
this regex works as follows
the first characters (.*?), mean - take the string until the next match(witch is the ,)
. means every character
the * means that it can be many times (many characters and not just 1)
? means dont be greedy, that means that it will use the first ',' and another one (if there are many ',')
after that there is this again (.*?)
again take the characters until the next thing to match (with is the constant words)
after that there is (sold post-exercise|sold|bought|exercised) witch means - find one of the words (sperated by | )
after that there is a .*? witch again means take all text until next match (this time it is not surounded by () so it wont be selected as a group and wont be part of the output)
([\d|,]+) means take a digit (\d) or a comma. the + stands for one or more times
again .*? like before
'price of ' finds the actual string 'price of '
and last ([\d|.]+) means again take a digit or a dot (escaped because the character . is used by regex for 'any character') one or more times

Categories