Using RegEx in Python to extract contents

Using RegEx in Python to extract contents - python

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?

For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')

Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

Related

Extract date from a string with a lot of numbers

There seems to be quite a few ways to extract datetimes in various formats from a string. But there seems to be an issue when the string contains many numbers and symbols.
Here is an example:
t = 'Annual Transmission Revenue Requirements and Rates Transmission Owner (Transmission Zone) Annual Transmission Revenue Requirement Network Integration Transmission Service Rate ($/MW-Year) AE (AECO) $136,632,319 $53,775 AEP (AEP) $1,295,660,732 $59,818.14 AP (APS) $128,000,000 $17,895 ATSI (ATSI) $659,094,666 $54,689.39 BC (BGE) $230,595,535 $35,762 ComEd, Rochelle (CE) $702,431,433 $34,515.60 Dayton (DAY) $40,100,000 $13,295.76 Duke (DEOK) $121,250,903 $24,077 Duquesne (DLCO) $139,341,808 $51,954.44 Dominion (DOM) $1,031,382,000 $52,457.21 DPL, ODEC (DPL) $163,224,128 $42,812 East Kentucky Power Cooperative (EKPC) $83,267,903 $24,441 MAIT (METED, PENELEC) $150,858,703 $26,069.39 JCPL $135,000,000 $23,597.27 PE (PECO) $155,439,100 $19,093 PPL, AECoop, UGI (PPL) $435,349,329 $58,865 PEPCO, SMECO (PEPCO) $190,876,083 $31,304.21 PS (PSEG) $1,248,819,352 $130,535.22 Rockland (RECO) $17,724,263 $44,799 TrAILCo $226,652,117.80 n/a Effective June 1, 2018 '
import datefinder
m = datefinder.find_dates(t)
for match in m:
print(match)
Is there a way to smoothly extract the date? I can resort to re for specific formats if no better way exists. From github of datefinder it seems that it was abandoned a year ago.

Although I dont know exactly how your dates are formatted, here's a regex solution that will work with dates separated by '/'. Should work with dates where the months and days are expressed as a single number or if they include a leading zero.
If your dates are separated by hyphens instead, replace the 9th and 18th character of the regex with a hyphen instead of /. (If using the second print statement, replace the 12th and 31st character)
Edit: Added the second print statement with some better regex. That's probably the better way to go.
import re
mystring = r'joasidj9238nlsd93901/01/2021oijweo8939n'
print(re.findall('\d{1,2}\/\d{1,2}\/\d{2,4}', mystring)) # This would probably work in most cases
print(re.findall('[0-1]{0,2}\/[0-3]{0,1}\d{0,1}\/\d{2,4}', mystring)) # This one is probably a better solution. (More protection against weirdness.)
Edit #2: Here's a way to do it with the month name spelled out (in full, or 3-character abbreviation), followed by day, followed by comma, followed by a 2 or 4 digit year.
import re
mystring = r'Jan 1, 2020'
print(re.findall(r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{1,2}\,\s+\d{2,4}',mystring))

Add a single space and comma between words that are connected using regex

I have a nested list_3 which looks like:
[['Company OverviewCompany: HowSector: SoftwareYear Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more togetherUniversity Affiliation(s): Duke$ Raised: $240,000Investors: Friends & familyTraction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company OverviewCompany: GrubSector: SoftwareYear Founded: 2018One Sentence Pitch: Find food you likeUniversity Affiliation(s): Stanford$ Raised: $340,000Investors: Friends & familyTraction to Date: 40% of monthly active users (MAU) are also active weekly']]]
I would like to use regex to add a comma followed by a single space between each joined word ie(HowSector:, SoftwareYear, 2010One), So far I have tried to write a re.sub code to do, by selecting all the characters without whitespace and replacing this, but have run into some issues:
for i, list in enumerate(list_3):
list_3[i] = [re.sub('r\s\s+', ', ', word) for word in list]
list_33.append(list_3[i])
print(list_33)
error:
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
I would like the output to be:
[['Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together University, Affiliation(s): Duke, $ Raised: $240,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'],[...]]
Any ideas how I can use regex to do this?

The main problem is that your nested list has no constant levels. Sometimes it has 2 levels and sometimes it has 3 levels. This is why you are getting the above error. In the case the list has 3 levels, re.sub receives a list as the third argument instead of a string.
The second problem is that the regex you are using is not the correct regex. The most naive regex we can use here should (at the very least) be able to find a non-whitespace charcter followed by a capital letter.
In the below example code, I'm using re.compile (since the same regex will be used over and over again, we might as well pre-compile it and gain some performance boost) and I'm just printing the output. You'll need to figure out a way to get the output in the format you want.
regex = re.compile(r'(\S)([A-Z])')
replacement = r'\1, \2'
for inner_list in nested_list:
for string_or_list in inner_list:
if isinstance(string_or_list, str):
print(regex.sub(replacement, string_or_list))
else:
for string in string_or_list:
print(regex.sub(replacement, string))
Outputs
Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: $240,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (, MA, U) are also active weekly
Company Overview, Company: Grub, Sector: Software, Year Founded: 2018, One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: $340,000, Investors: Friends & family, Traction to Date: 40% of monthly active users (, MA, U) are also active weekly

I believe you can use the following Python code.
rgx = r'(?<=[a-z\d])([A-Z$][A-Za-z]*(?: +\S+?)*)*:'
rep = r', \1:'
re.sub(rgx, rep, s)
where s is the string.
Start your engine! | Python code
Python's regex engine performs the following operations when matching.
(?<= : begin positive lookbehind
[a-z\d] : match a letter or digit
) : end positive lookbehind
( : begin capture group 1
[A-Z$] : match a capital letter or '$'
[A-Za-z]* : match 0+ letters
(?: +\S+?) : match 1+ spaces greedily, 1+ non-spaces
non-greedily in a non-capture group
* : execute non-capture group 0+ times
) : end capture group
: : match ':'
Note that the positive lookbehind and permissible characters for each token in the capture group may need to be adjusted to suit requirements.
The regular expression employed to construct replacement strings (, \1:) creates the string ', ' followed by the contents of capture group 1 followed by a colon.

If your list of lists is arbitrary deep, you can recursively traverse it and process (with THIS regex) the strings and yield the same structure:
import re
from collections.abc import Iterable
def process(l):
for el in l:
if isinstance(el, Iterable) and not isinstance(el, (str, bytes)):
yield type(el)(process(el))
else:
yield ', '.join(re.split(r'(?<=[a-z])(?=[A-Z])', el))
Given your example as LoL here is the result:
>>> list(process(LoL))
[['Company Overview, Company: How, Sector: Software, Year Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: $240,000Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company Overview, Company: Grub, Sector: Software, Year Founded: 2018One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: $340,000Investors: Friends & family, Traction to Date: 40% of monthly active users (MAU) are also active weekly']]]

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000

You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']

Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']

I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

Removing varying text phrases through RegEx in a Python Data frame

Basically, I want to remove the certain phrase patterns embedded in my text data:
Starts with an upper case letter and ends with an Em Dash "—"
Starts with an Em Dash "—" and ends with a "Read Next"
Say, I've got the following data:
CEBU CITY—The widow of slain human rights lawyer .... citing figures from the NUPL that showed that 34 lawyers had been killed in the past two years. —WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next
and
Manila, Philippines—President .... but justice will eventually push its way through their walls of impunity, ... —REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next
I want to remove the following phrases:
"CEBU CITY—"
"—WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next"
"Manila, Philippines—"
"—REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next"
I am assuming this would be needing two regex for each patterns enumerated above.
The regex: —[A-Z].*Read Next\s*$ may work on the pattern # 2 but only when there are no other em dashes in the text data. It will not work when pattern # 1 occurs as it will remove the chunk from the first em dash it has seen until the "Read Next" string.
I have tried the following regex for pattern # 1:
^[A-Z]([A-Za-z]).+(—)$
But how come it does not work. That regex was supposed to look for a phrase that starts with any upper case letter, followed by any length of string as long as it ends with an "—".

What you are considering a hyphen - is not indeed a hyphen instead called Em Dash, hence you need to use this regex which has em dash instead of hyphen in start,
^—[A-Z].*Read Next\s*$
Here is the explanation for this regex,
^ --> Start of input
— --> Matches a literal Em Dash whose Unicode Decimal Code is 8212
[A-Z] --> Matches an upper case letter
.* --> Matches any character zero or more times
Read Next --> Matches these literal words
\s* --> This is for matching any optional white space that might be present at the end of line
$ --> End of input
Online demo

The regex that should take care of this -
^—[A-Z]+(.)*(Read Next)$
You can try implementing this regex on your data and see if it works out.

How to parse exact data without including surrounding text?

My code is very close to succeeding but I just need a little help.
I have 100's of pages of data but I am working on parsing only 1 page perfectly before I apply it to the others. In this one page, which is an email, I need to extract several things: a Date, Sector, Fish Species, Pounds, and Money. So far I have been successful in using RegularExpressions to recognize certain words and extract the data from that line: such as looking for "Sent" because I know the Date information will always follow that, and looking for either "Pounds" or "lbs" because the Pounds information will always precede that.
The problem I am having is that my code is grabbing the entire line that the data is on, not just the numeric data. I want to grab just the number value for Pounds, for example, but I realize this will be extremely difficult because every one of the 100's of emails is worded differently. I'm not sure if it is even possible to make this code foolproof because I need RegEx to recognize the text that surrounds the data, but not include it in my export command. So will I simply be blindly grabbing at characters following certain recognized words?
Here is a piece of my code used for extracting the Pounds data:
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
sector_result = []
pattern = re.compile("Pounds | lbs", re.IGNORECASE)
for linenum, line in enumerate(f):
if pattern.search(line) != None:
sector_result.append((linenum, line.rstrip('\n')))
for linenum, line in sector_result:
print ("Pounds:", line)
And here is what it prints out:
Pounds: -GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
Pounds: -GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
Pounds: -American Plaice 2,000 lbs .60 lbs or best offer
Ideally I would just like the 5,000 lb numeric value to be exported but I am not sure how I would go about grabbing just that number.
Here is the original email text I need to parse:
From:
Sent: Friday, November 15, 2013 2:43pm
To:
Subject: NEFS 11 fish for lease
Greetings,
NEFS 11 has the following fish for lease:
-GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
-American Plaice 2,000 lbs .60 lbs or best offer
Here is another separate email though that will need to be parsed; this is why writing this code is difficult because it'll have to tackle a variety of differently worded emails, since all are written by different people:
From:
Sent: Monday, December 09, 2013 1:13pm
To:
Subject: NEFS 6 Stocks for lease October 28 2013
Hi All,
The following is available from NEFS VI:
4,000 lbs. GBE COD (live wt)
10,000 lbs. SNE Winter Flounder
10,000 lbs. SNE Yellowtail
10,000 lbs GB Winter Flounder
Will lease for cash or trade for GOM YT, GOM Cod, Dabs, Grey sole stocks on equitable basis.
Please forward all offers.
Thank you,
Any and all help is appreciated, as well as question asking criticism. Thanks.

Here's a regex flexible enough:
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
pattern = r'(\d[\d,.]+)\s*(?:lbs|[Pp]ounds)'
content = f.read()
### if you want only the first match ###
match = re.search(pattern, content)
if match:
print(match.group(1))
### if you want all the matches ###
matches = re.findall(pattern, content)
if matches:
print(matches)
You could be more thorough with the regex if needed.
Hope this helps!
UPDATE
The main part here is the regular expression (\d[\d,.]+)\s*(?:lbs|[Pp]ounds). This is a basic one, explained as follows:
(
\d -> Start with any digit character
[\d,.]+ -> Followed by either other digits or commas or dots
)
\s* -> Followed by zero or more spaces
(?:
lbs|[Pp]ounds -> Followed by either 'lbs' or 'Pounds' or 'pounds'
)
The parenthesis define the capturing group, so (\d[\d,.]+) is the stuff being captured, so basically the numeric part.
The parenthesis with a ?: define a non-capturing group.
This regex will match:
2,890 lbs (capturing '2,890')
3.6 pounds (capturing '3.6')
5678829 Pounds
23 lbs
9,894Pounds
etc
As well as unwanted stuff like:
2..... lbs
3,4,6,7,8 pounds
It will not match:
7,423
23m lbs
45 ppounds
2.8 Pound
You could make a much more complicated regex depending on the complexity of the contents you have. I would think this regex is good enough for your purposes.
Hope this helps clarify

Regex can recognize and not export text around a value, this is called a non-capturing group. For example:
Pounds: -GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
To recognize, up to, the value you want, and (live wt) you could write a regex like this:
(?: up to).(\d+,\d+.lbs).(?:\(live wt\))
Essentially (?:) is a matching group that isn't captured, so the regex only captures the middle bracketed group.
If you provide the exact surrounding text you want I can be more specific.
Edit:
Going off your new examples I can see that the only similarity between all examples is that you have a number (in the thousands so it has a ,), followed by some amount of whitespace, followed by lbs. So your regex would look like:
(?:(\d+,\d+)\s+lbs)
This will return the matches of the numbers themselves. You can see an example it working here. This regex will exclude the smaller values, by virtue of ignoring values that are not in the thousands (i.e. that do not contain a ,).
Edit 2:
Also I'd figure I'd point out that this can be done entirely without regex using str.split(). Instead of trying to find a particular word pattern, you can just use the fact that the number you want will be the word before lbs, i.e. if lbs is at position i, then your number is at position i-1.
The only other consideration you have to face is how to deal with multiple values, the two obvious ones are:
Biggest value.
First value.
Here's how both cases would work with your original code:
def max_pounds(line):
pound_values = {}
words = line.split()
for i, word in enumerate(words):
if word.lower() == 'lbs':
# Convert the number into an float
# And save the original string representation.
pound_values[(float(words[i-1].replace(',','')))] = words[i-1]
# Print the biggest numerical number.
print(pound_values[max(pound_values.keys())])
def first_pounds(line):
words = line.split()
for i, word in enumerate(words):
if word.lower() == 'lbs':
# print the number and exit.
print(words[i-1])
return
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
sector_result = []
pattern = re.compile("Pounds | lbs", re.IGNORECASE)
for linenum, line in enumerate(f):
if pattern.search(line) != None:
sector_result.append((linenum, line.rstrip('\n')))
for linenum, line in sector_result:
print ("Pounds:", line)
# Only one function is required.
max_pounds(line)
first_pounts(line)
The one caveat is that the code doesn't handle the edge case where lbs is the first word, but this is easily handled with a try-catch.
Neither regex or split will work if the value before lbs is something other than the number. If you run into that problem I would suggest searching your data for offending emails - and if the number is small enough editing them by hand.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.