How to parse exact data without including surrounding text?

How to parse exact data without including surrounding text? - python

My code is very close to succeeding but I just need a little help.
I have 100's of pages of data but I am working on parsing only 1 page perfectly before I apply it to the others. In this one page, which is an email, I need to extract several things: a Date, Sector, Fish Species, Pounds, and Money. So far I have been successful in using RegularExpressions to recognize certain words and extract the data from that line: such as looking for "Sent" because I know the Date information will always follow that, and looking for either "Pounds" or "lbs" because the Pounds information will always precede that.
The problem I am having is that my code is grabbing the entire line that the data is on, not just the numeric data. I want to grab just the number value for Pounds, for example, but I realize this will be extremely difficult because every one of the 100's of emails is worded differently. I'm not sure if it is even possible to make this code foolproof because I need RegEx to recognize the text that surrounds the data, but not include it in my export command. So will I simply be blindly grabbing at characters following certain recognized words?
Here is a piece of my code used for extracting the Pounds data:
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
sector_result = []
pattern = re.compile("Pounds | lbs", re.IGNORECASE)
for linenum, line in enumerate(f):
if pattern.search(line) != None:
sector_result.append((linenum, line.rstrip('\n')))
for linenum, line in sector_result:
print ("Pounds:", line)
And here is what it prints out:
Pounds: -GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
Pounds: -GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
Pounds: -American Plaice 2,000 lbs .60 lbs or best offer
Ideally I would just like the 5,000 lb numeric value to be exported but I am not sure how I would go about grabbing just that number.
Here is the original email text I need to parse:
From:
Sent: Friday, November 15, 2013 2:43pm
To:
Subject: NEFS 11 fish for lease
Greetings,
NEFS 11 has the following fish for lease:
-GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
-American Plaice 2,000 lbs .60 lbs or best offer
Here is another separate email though that will need to be parsed; this is why writing this code is difficult because it'll have to tackle a variety of differently worded emails, since all are written by different people:
From:
Sent: Monday, December 09, 2013 1:13pm
To:
Subject: NEFS 6 Stocks for lease October 28 2013
Hi All,
The following is available from NEFS VI:
4,000 lbs. GBE COD (live wt)
10,000 lbs. SNE Winter Flounder
10,000 lbs. SNE Yellowtail
10,000 lbs GB Winter Flounder
Will lease for cash or trade for GOM YT, GOM Cod, Dabs, Grey sole stocks on equitable basis.
Please forward all offers.
Thank you,
Any and all help is appreciated, as well as question asking criticism. Thanks.

Here's a regex flexible enough:
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
pattern = r'(\d[\d,.]+)\s*(?:lbs|[Pp]ounds)'
content = f.read()
### if you want only the first match ###
match = re.search(pattern, content)
if match:
print(match.group(1))
### if you want all the matches ###
matches = re.findall(pattern, content)
if matches:
print(matches)
You could be more thorough with the regex if needed.
Hope this helps!
UPDATE
The main part here is the regular expression (\d[\d,.]+)\s*(?:lbs|[Pp]ounds). This is a basic one, explained as follows:
(
\d -> Start with any digit character
[\d,.]+ -> Followed by either other digits or commas or dots
)
\s* -> Followed by zero or more spaces
(?:
lbs|[Pp]ounds -> Followed by either 'lbs' or 'Pounds' or 'pounds'
)
The parenthesis define the capturing group, so (\d[\d,.]+) is the stuff being captured, so basically the numeric part.
The parenthesis with a ?: define a non-capturing group.
This regex will match:
2,890 lbs (capturing '2,890')
3.6 pounds (capturing '3.6')
5678829 Pounds
23 lbs
9,894Pounds
etc
As well as unwanted stuff like:
2..... lbs
3,4,6,7,8 pounds
It will not match:
7,423
23m lbs
45 ppounds
2.8 Pound
You could make a much more complicated regex depending on the complexity of the contents you have. I would think this regex is good enough for your purposes.
Hope this helps clarify

Regex can recognize and not export text around a value, this is called a non-capturing group. For example:
Pounds: -GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
To recognize, up to, the value you want, and (live wt) you could write a regex like this:
(?: up to).(\d+,\d+.lbs).(?:\(live wt\))
Essentially (?:) is a matching group that isn't captured, so the regex only captures the middle bracketed group.
If you provide the exact surrounding text you want I can be more specific.
Edit:
Going off your new examples I can see that the only similarity between all examples is that you have a number (in the thousands so it has a ,), followed by some amount of whitespace, followed by lbs. So your regex would look like:
(?:(\d+,\d+)\s+lbs)
This will return the matches of the numbers themselves. You can see an example it working here. This regex will exclude the smaller values, by virtue of ignoring values that are not in the thousands (i.e. that do not contain a ,).
Edit 2:
Also I'd figure I'd point out that this can be done entirely without regex using str.split(). Instead of trying to find a particular word pattern, you can just use the fact that the number you want will be the word before lbs, i.e. if lbs is at position i, then your number is at position i-1.
The only other consideration you have to face is how to deal with multiple values, the two obvious ones are:
Biggest value.
First value.
Here's how both cases would work with your original code:
def max_pounds(line):
pound_values = {}
words = line.split()
for i, word in enumerate(words):
if word.lower() == 'lbs':
# Convert the number into an float
# And save the original string representation.
pound_values[(float(words[i-1].replace(',','')))] = words[i-1]
# Print the biggest numerical number.
print(pound_values[max(pound_values.keys())])
def first_pounds(line):
words = line.split()
for i, word in enumerate(words):
if word.lower() == 'lbs':
# print the number and exit.
print(words[i-1])
return
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
sector_result = []
pattern = re.compile("Pounds | lbs", re.IGNORECASE)
for linenum, line in enumerate(f):
if pattern.search(line) != None:
sector_result.append((linenum, line.rstrip('\n')))
for linenum, line in sector_result:
print ("Pounds:", line)
# Only one function is required.
max_pounds(line)
first_pounts(line)
The one caveat is that the code doesn't handle the edge case where lbs is the first word, but this is easily handled with a try-catch.
Neither regex or split will work if the value before lbs is something other than the number. If you run into that problem I would suggest searching your data for offending emails - and if the number is small enough editing them by hand.

Related

Using RegEx in Python to extract contents

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?

For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')

Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

Check if there are numbers around a keyword in a text file

I am having a text file 'Filter.txt' which contains a specific keyword 'D&O insurance'. I would check if there are numbers in the sentence which contains that keyword, as well as the 2 sentences before and after that.
For example, I have a long paragraphe like this:
"International insurance programs necessary for companies with global subsidiaries and offices. Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. D&O insurance grants cover on a claims-made basis. How much is enough? What and who is covered – and not covered? "
The target word is "D&O insurance." If I wanted to extract the target sentence (D&O insurance grants cover on a claims-made basis.) as well as the preceding and following sentences (Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. and How much is enough?), what would be a good approach?
This is what I'm trying to do so far. However I don't really know how to apply to find ways to check in the whole sentence and the ones around it.
for line in open('Filter.txt'):
match = re.search('D&O insurance(\d+)',line)
if match:
print match.group(1)
I'm new to programming, so I'm looking for the possible solutions for that purpose.
Thank you for your help!

Okay I'm going to take a stab at this. Assume string is the entire contents of your .txt file (you may need to clean the '/n's out).
You're going to want to make a list of potential sentence endings, use that list to find the index positions of the sentence endings, and then use THAT list to make a list of the sentences in the file.
string = "International insurance programs necessary for companies with global subsidiaries and offices. Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. D&O insurance grants cover on a claims-made basis. How much is enough? What and who is covered – and not covered?"
endings = ['! ', '? ','. ']
def pos_find(string):
lst = []
for ending in endings:
i = string.find(ending)
if i != -1:
lst.append(string.find(ending))
return min(lst)
def sort_sentences(string):
sentences = []
while True:
try:
i = pos_find(string)
sentences.append(string[0:i+1])
string = string[i+2:]
except ValueError:
sentences.append(string)
break
return sentences
sentences = sort_sentences(string)
Once you have the list of sentences (I got a little weary here, so forgive the spaghetti code - the functionality is there), you will need to comb through that list to find characters that could be integers (this is how I'm checking for numbers...but you COULD do it different).
for i in range(len(sentences)):
sentence = sentences[i]
match = sentence.find('D&O insurance')
print(match)
if match >= 0:
lst = [sentences[i-1],sentence, sentences[i+2]]
for j in range(len(lst)):
sen = lst[j]
for char in sen:
try:
int(char)
print(f'Found {char} in "{sen}", index {j}')
except ValueError:
pass
Note that you will have to make some modifications to capture multi-number numbers. This will just print something for each integer in the full number (i.e. it will print a statement for 1, 0, and 0 if it finds 100 in the file). You will also need to catch the two edge cases where the D&O insurance substring is found in the first or last sentences. In the code above, you would throw an error because there would be no i-1 (if it's the first) index location.

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000

You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']

Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']

I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

Extracting numerical values from a string with at most 6 digits with optional 2 digits for decimal

I have a task from which I need to extract values from a text that represent numerical values. However I am interested in extracting values that have at most 6 digits with decimal being optional.
For example, from the below text:
Total compensation for Mr. XYZ was $5,123,456 and other salary which was $650,000 in fiscal 2018, was determined to be approximately 8.78 times the median annual compensation for all of the firm's other employees, which was approximately $74,000. Some other salaries are 56000.
I need to extract
["650,000", "2018", "8.78", "74,000", "56000"]
from this.
The regex I am using:
((\d{1,3})(?:,[0-9]{3}){0,1}|(\d{1,6}))(\.\d{1,2})?
It is correctly identifying 650,000 and 74,000 but doesn't identify others correctly.
I found this 7 digit money regex and worked around it to make one for 6 digit but wasn't successful. How do I correct my regex?

Try this : (?<![\d,.])(?:\d,?){0,5}\d(?:\.\d+)?(?!,?\d)
Here's a detailed explanation:
(?x) # flag for readable mode, whitespaces and comments are ignored
# Make sure to not start in the middle of a number, so no digit, comma or dot before the match
(?<![\d,.])
# k-1 digits, with facultative comma between each. Therefore 5,4,3,2 are allowed for the sake of simplicity, be aware of that
(?:\d,?){0,5}
#The kth digit
\d
# Facultative dot and decimal part
(?:\.\d+)?
# Make sure to not stop in the middle of a big number, so no digit after. Comma is allowed, but only for the grammatical comma, so comma+digit is forbidden
(?!,?\d)
There could be improvement, but I think it's what you wanted. There might be some cases not handled, tell me if you find some.
Test it here : https://regex101.com/r/Wxi5Sj/2

Try below code
import re
input = "Total compensation for Mr. XYZ was $5,123,456 and other salary which was $650,000 in fiscal 2018, was determined to be approximately 8.78 times the median annual compensation for all of the firm's other employees, which was approximately $74,000. Some other salaries are 56000. "
print(re.findall(r'(?<=\s)\$?\d{0,3}\,?\d{1,3}(?:\.\d{2})?(?!,?\d)', input))
Output
['$650,000', '2018', '8.78', '$74,000', '56000']

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.

Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])

You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.