Regex replace in Python picking a specific substring - python

Here's what I want to happen:
input = "asdsad,200200-12964,0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17"" 0.00000000,1.000000"
output = "asdsad,200200-12964,0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589.037.17"" 0.00000000,1.000000"
How can I change the comma (,) to a dot (.) between ""...589,037.17..."" in Python using regex.
Extra: 589,037.17 => 589.037.17
I tried:
print(re.sub(r'(?<=\d),', '.', input))
But my output was:
output = "asdsad,200200-12964.0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589.037.17"" 0.00000000,1.000000"

First, don't call a variable input, because it overwrites the the built-in function input(). Also you repeated strings are just one string in Python.
i = 'asdsad,200200-12964,0009,TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17 0.00000000,1.000000'
To solve your specific case, you could match a the country code followed by 3 numbers in the first bit of the price before the comma. That works for this, but probably isn't generic enough for any country code and any price, as look-behinds must be of fixed width.
print(re.sub(r'(?<=USD \d{3}),', '.', i))
I would use a look-behind for the country code and space, then group the first bit of the number and replace it with a backreference.
print(re.sub(r'(?<=[A-Z]{3} )(\d+),', r'\1.', i))

import re
input = "asdsad,200200-12964,0009,""TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17"" 0.00000000,1.000000"
print(input)
print(re.sub(r'USD (\d+),(\d+)', r'USD \1.\2', input))
Output:
asdsad,200200-12964,0009,TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589,037.17 0.00000000,1.000000
asdsad,200200-12964,0009,TREASURY SETTLEMENT NON-COMPLIANCE ASSESSMENT FOR CPD2020-01-21 USD 589.037.17 0.00000000,1.000000
You can go through this Search and Replace and this link for documenation on this.

Related

Returning empty string for missing capture group Python regex

I'm working on parsing string text containing information on university, year, degree field, and whether or not a person graduated. Here are two examples:
ex1 = 'BYU: 1990 Bachelor of Arts Theater (Graduated):BYU: 1990 Bachelor of Science Mathematics (Graduated):UNIVERSITY OF VIRGINIA: 1995 Master of Science Mechanical Engineering (Graduated):MICHIGAN STATE UNIVERSITY: 2008 Master of Fine Arts INDUSTRIAL DESIGN (Graduated)'
ex2 = 'UCSD: 2001 Bachelor of Arts English:UCLA: 2005 Bachelor of Science Economics (Graduated):UCSD 2010 Master of Science Economics'
What I am struggling to accomplish is to have an entry for each school experience regardless of whether specific information is missing. In particular, imagine I wanted to pull whether each degree was finished from ex1 and ex2 above. When I try to use re.findall I end up with something like the following for ex1:
# Code:
re.findall('[A-Z ]+: \d+ Bachelor [^:]+\((Graduated)', ex1)
# Output:
['Graduated', 'Graduated']
which is what I want, two entries for two Bachelor's degrees. For ex2, however, one of the Bachelor's degrees was unfinished so the text does not contain "(Graduated)", so the output is the following:
# Code:
re.findall('[A-Z ]+: \d+ Bachelor [^:]+\((Graduated)', ex2)
# Output:
['Graduated']
# Desired Output:
['', 'Graduated']
I have tried making the capture group optional or including the colon after graduated and am not making much headway. The example I am using is the "Graduated" information, but in principle the more general question remains if there is an identifiable degree but it is missing one or two pieces of information (like graduation year or university). Ultimately I am just looking to have complete information on each degree, including whether certain pieces of information are missing. Thank you for any help you can provide!
You can use the ?-Quantifier to match "Graduated" (and the paranthesis () between 0 and n times.
re.findall('[A-Z ]+: \d+ Bachelor [^:()]*\(?(Graduated)?', ex2)
Output:
>>> re.findall('[A-Z ]+: \d+ Bachelor [^:()]*\(?(Graduated)?', ex2)
['', 'Graduated']
How about this?
[re.sub('[(:)]', '', t) for t in [re.sub('^[^\(]+','', s) for s in re.findall('[A-Z ]+: \d+ Bachelor [^:]+:', ex1)]]
# output ['Graduated', 'Graduated']
[re.sub('[(:)]', '', t) for t in [re.sub('^[^\(]+','', s) for s in re.findall('[A-Z ]+: \d+ Bachelor [^:]+:', ex2)]]
# output ['', 'Graduated']

Add a single space and comma between words that are connected using regex

I have a nested list_3 which looks like:
[['Company OverviewCompany: HowSector: SoftwareYear Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more togetherUniversity Affiliation(s): Duke$ Raised: $240,000Investors: Friends & familyTraction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company OverviewCompany: GrubSector: SoftwareYear Founded: 2018One Sentence Pitch: Find food you likeUniversity Affiliation(s): Stanford$ Raised: $340,000Investors: Friends & familyTraction to Date: 40% of monthly active users (MAU) are also active weekly']]]
I would like to use regex to add a comma followed by a single space between each joined word ie(HowSector:, SoftwareYear, 2010One), So far I have tried to write a re.sub code to do, by selecting all the characters without whitespace and replacing this, but have run into some issues:
for i, list in enumerate(list_3):
list_3[i] = [re.sub('r\s\s+', ', ', word) for word in list]
list_33.append(list_3[i])
print(list_33)
error:
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
I would like the output to be:
[['Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together University, Affiliation(s): Duke, $ Raised: $240,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'],[...]]
Any ideas how I can use regex to do this?
The main problem is that your nested list has no constant levels. Sometimes it has 2 levels and sometimes it has 3 levels. This is why you are getting the above error. In the case the list has 3 levels, re.sub receives a list as the third argument instead of a string.
The second problem is that the regex you are using is not the correct regex. The most naive regex we can use here should (at the very least) be able to find a non-whitespace charcter followed by a capital letter.
In the below example code, I'm using re.compile (since the same regex will be used over and over again, we might as well pre-compile it and gain some performance boost) and I'm just printing the output. You'll need to figure out a way to get the output in the format you want.
regex = re.compile(r'(\S)([A-Z])')
replacement = r'\1, \2'
for inner_list in nested_list:
for string_or_list in inner_list:
if isinstance(string_or_list, str):
print(regex.sub(replacement, string_or_list))
else:
for string in string_or_list:
print(regex.sub(replacement, string))
Outputs
Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: $240,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (, MA, U) are also active weekly
Company Overview, Company: Grub, Sector: Software, Year Founded: 2018, One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: $340,000, Investors: Friends & family, Traction to Date: 40% of monthly active users (, MA, U) are also active weekly
I believe you can use the following Python code.
rgx = r'(?<=[a-z\d])([A-Z$][A-Za-z]*(?: +\S+?)*)*:'
rep = r', \1:'
re.sub(rgx, rep, s)
where s is the string.
Start your engine! | Python code
Python's regex engine performs the following operations when matching.
(?<= : begin positive lookbehind
[a-z\d] : match a letter or digit
) : end positive lookbehind
( : begin capture group 1
[A-Z$] : match a capital letter or '$'
[A-Za-z]* : match 0+ letters
(?: +\S+?) : match 1+ spaces greedily, 1+ non-spaces
non-greedily in a non-capture group
* : execute non-capture group 0+ times
) : end capture group
: : match ':'
Note that the positive lookbehind and permissible characters for each token in the capture group may need to be adjusted to suit requirements.
The regular expression employed to construct replacement strings (, \1:) creates the string ', ' followed by the contents of capture group 1 followed by a colon.
If your list of lists is arbitrary deep, you can recursively traverse it and process (with THIS regex) the strings and yield the same structure:
import re
from collections.abc import Iterable
def process(l):
for el in l:
if isinstance(el, Iterable) and not isinstance(el, (str, bytes)):
yield type(el)(process(el))
else:
yield ', '.join(re.split(r'(?<=[a-z])(?=[A-Z])', el))
Given your example as LoL here is the result:
>>> list(process(LoL))
[['Company Overview, Company: How, Sector: Software, Year Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: $240,000Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company Overview, Company: Grub, Sector: Software, Year Founded: 2018One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: $340,000Investors: Friends & family, Traction to Date: 40% of monthly active users (MAU) are also active weekly']]]

Can't refrain my script from grabbing unnecessary lines

I've written a script in python to get certain from a text container. I used re module to do the job. However, it is giving me unnecesary output along with the required ones.
How can I modify my expression to be stick to the lines I wanna grab?
This is my try:
import re
content = """
A Gross exaggeration,
-- Gross 5 90,630,08,
Gross 4 13,360,023,
Gross 2 70,940,02,
Luke gross is an actor
"""
for item in re.finditer(r'Gross(?:[\d\s,]*)',content):
print(item.group().strip())
Output I'm having:
Gross
Gross 5 90,630,08,
Gross 4 13,360,023,
Gross 2 70,940,02,
Output I wish to have:
Gross 4 13,360,023
Gross 2 70,940,02
I changed the regex string to r'(?:^\s*?)Gross[\d\s,]*?(?=,$)' and added multiline flag (online regex here):
import re
content = """
A Gross exaggeration,
-- Gross 5 90,630,08,
Gross 4 13,360,023,
Gross 2 70,940,02,
Luke gross is an actor
"""
for item in re.finditer(r'(?:^\s*?)Gross[\d\s,]*?(?=,$)',content, flags=re.M):
print(item.group().strip())
Output is:
Gross 4 13,360,023
Gross 2 70,940,02
^\s*Gross[\d ,]*(?=,) Will capture what you want.
I just tacked on ^ to signal the start of the line, used \s* to indicate optional whitespace before "gross" and trimmed the , from the end. I also removed your \s from your character class because it worked with new lines. I replaced it with a blank space.
Demo

Good regex to extract price

I am trying to extract price from various currency values. Here are my sample input values:
でレンタル HD(高画質) ¥ 500
で購入  HD(高画質) ¥ 2,500
Buy SD £5.99
Buy SD £14.99
HD ausleihen EUR 3,99
HD kaufen EUR 11,99
Buy Movie HD $19.99
$1,200.84
How would I get this currency value into a float, for example 19.99 ? The regex I had so far is:
re.findall(r'[\d|\,|\.]+', s)[0].replace(',', '')
But it seems insufficient. What would be a better one?
A regex that will match ANY currencies from a string, before or after a currency type word/symbol, you may use
(?:\b(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)|[$\u00A2-\u00A5\u058F\u060B\u07FE\u07FF\u09F2\u09F3\u09FB\u0AF1\u0BF9\u0E3F\u17DB\u20A0-\u20C0\uA838\uFDFC\uFE69\uFF04\uFFE0\uFFE1\uFFE5\uFFE6\U00011FDD-\U00011FE0\U0001E2FF\U0001ECB0])\s*(\d+(?:[.,]\d+)*)|(\d+(?:[.,]\d+)*)\s*(?:(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)\b|[$\u00A2-\u00A5\u058F\u060B\u07FE\u07FF\u09F2\u09F3\u09FB\u0AF1\u0BF9\u0E3F\u17DB\u20A0-\u20C0\uA838\uFDFC\uFE69\uFF04\uFFE0\uFFE1\uFFE5\uFFE6\U00011FDD-\U00011FE0\U0001E2FF\U0001ECB0])
See the regex demo. It includes USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR pattern that matches most common world currencies and [$\u00A2-\u00A5\u058F\u060B\u07FE\u07FF\u09F2\u09F3\u09FB\u0AF1\u0BF9\u0E3F\u17DB\u20A0-\u20C0\uA838\uFDFC\uFE69\uFF04\uFFE0\uFFE1\uFFE5\uFFE6\U00011FDD-\U00011FE0\U0001E2FF\U0001ECB0] that matches any currency symbols (equivalent of \p{Sc} in PCRE).
In Python, you will need a bit of code to make it work as you need:
import re
texts = ['でレンタル HD(高画質) ¥ 500',
'で購入  HD(高画質) ¥ 2,500',
'Buy SD £5.99',
'Buy SD £14.99',
'HD ausleihen EUR 3,99',
'HD kaufen EUR 11,99',
'Buy Movie HD $19.99',
'$1,200.84'
]
curword = r'(?:USD|GBP|EUR|JPY|CHF|SEK|DKK|NOK|SGD|HKD|AUD|TWD|NZD|CNY|KRW|INR|CAD|VEF|EGP|THB|IDR|PKR|MYR|PHP|MXN|VND|CZK|HUF|PLN|TRY|ZAR|ILS|ARS|CLP|BRL|RUB|QAR|AED|COP|PEN|CNH|KWD|SAR)'
cursymbol = r'[$\u00A2-\u00A5\u058F\u060B\u07FE\u07FF\u09F2\u09F3\u09FB\u0AF1\u0BF9\u0E3F\u17DB\u20A0-\u20C0\uA838\uFDFC\uFE69\uFF04\uFFE0\uFFE1\uFFE5\uFFE6\U00011FDD-\U00011FE0\U0001E2FF\U0001ECB0]'
num = r'\d+(?:[.,]\d+)*'
pattern = re.compile(fr'(?:\b{curword}|{cursymbol})\s*({num})|({num})\s*(?:{curword}\b|{cursymbol})')
print(fr'(?:\b{curword}|{cursymbol})\s*({num})|({num})\s*(?:{curword}\b|{cursymbol})')
for text in texts:
m = pattern.search(text)
if m:
result = m.group(1) or m.group(2)
print(result)
See the Python demo. It prints
500
2,500
5.99
14.99
3,99
11,99
19.99
1,200.84
If you need to convert string result to int/float, you can also capture the country currency word/symbol, then convert the decimal separator to the one you need and then parse to int or float.

Python regex to parse financial data

I am relatively new to regex (always struggled with it for some reason)...
I have text that is of this form:
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
Parsing the text reveals the following structure:
Two or more words beginning the sentence, and before the first comma, is the name of the person involved in the transaction
One or more words before ('sold'|'bought'|'exercised'|'sold post-exercise') is the title of the person
Presence of either one of these: ('sold'|'bought'|'exercised'|'sold post-exercise') AFTER the title, identifies the transaction type
first numeric string following the transaction type ('sold'|'bought'|'exercised'|'sold post-exercise') denotes the size of the transaction
'price of ' preceeds a numeric string, which specifies the price at which the deal was struck.
My question is:
How can I use this knowledge (and regex), to write a function that parses similar text to return the variables of interest (listed 1 - 5 above)?
Pseudo code for the function I want to write ..
def grok_directors_dealings_text(text_input):
name, title, transaction_type, lot_size, price = (None, None, None, None, None)
....
name = ...
title = ...
transaction_type = ...
lot_size = ...
price = ...
pass
How would I use regex to implement the functions to return the variables of interest when passed in text that conforms to the structure I have identified above?
[[Edit]]
For some reason, I have seemed to struggle with regex for a while, if I am to learn from the correct answer here on S.O, it will be much better, if an explanation is offered as to why the magical expression (sorry, regexpr) actually works.
I want to actually learn this stuff instead of copy pasting expressions ...
You can use the following regex:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
DEMO
Python:
import re
financialData = """
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
"""
print(re.findall('(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)',financialData))
Output:
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00p'), ('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75p'), ('Albert Ellis', 'CEO', 'bought', '262', '52.00p')]
EDIT 1
To understand how and what they mean, follow the DEMO link,on top right you can find a block explaining what each and every character means as follows:
Also Debuggex helps you simulate the string by showing what group matches which characters!
Here's a debuggex demo for your particular case:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
Debuggex Demo
I came up with this regex:
([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p
Debuggex Demo
Basically, we are using the parenthesis to capture the important info you want so let's check it out each one:
([\w ]+): \w matches any word character [a-zA-Z0-9_] one or more times, this will give us the name of the person;
([\w ]+)Another one of these after a space and comma to get the title;
(sold post-exercise|sold|bought|exercised) then we search for our transaction types. Notice I put the post-exercise before the post so that it tries to match the bigger word first;
([\d,\.]+) Then we try to find the numbers, which are made of digits (\d), a comma and probbably a dot may appear as well;
([\d\.,]+) Then we need to get to the price which is basically the same as the size of the transaction.
The regex that connects each group are pretty basic as well.
If you try it on regex101 it provides some explanation about the regex and generates this code in python to use:
import re
p = re.compile(ur'([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p')
test_str = u"David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...\n\nMark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...\n\nAlbert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by..."
re.findall(p, test_str)
You can use the following regex that just looks for characters surrounding the delimiters:
(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p
The parts in parentheses will be captured as groups.
>>> import re
>>> l = ['''David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...''', '''Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...''', '''Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...''']
>>> for s in l:
... print(re.findall(r'(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p', s))
...
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00')]
[('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75')]
[('Albert Ellis', 'CEO', 'bought', '262', '52.00')]
this is the regex that will do it
(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)
you use it like this
import re
def get_data(line):
pattern = r"(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)"
m = re.match(pattern, line)
return m.groups()
for the first line this will return
('David Meredith', ' Financial Director ', 'sold post-exercise', '15,000', '1044.00')
EDIT:
adding explanation
this regex works as follows
the first characters (.*?), mean - take the string until the next match(witch is the ,)
. means every character
the * means that it can be many times (many characters and not just 1)
? means dont be greedy, that means that it will use the first ',' and another one (if there are many ',')
after that there is this again (.*?)
again take the characters until the next thing to match (with is the constant words)
after that there is (sold post-exercise|sold|bought|exercised) witch means - find one of the words (sperated by | )
after that there is a .*? witch again means take all text until next match (this time it is not surounded by () so it wont be selected as a group and wont be part of the output)
([\d|,]+) means take a digit (\d) or a comma. the + stands for one or more times
again .*? like before
'price of ' finds the actual string 'price of '
and last ([\d|.]+) means again take a digit or a dot (escaped because the character . is used by regex for 'any character') one or more times

Categories