Python extract information after phrase or group of words - python

I am trying to extract information from PDF.
Simple search worked:
filecontent = ReadDoc.getContent("c:\\temp\\pdf_1.pdf")
match = re.search('Document ID: (\d+)', filecontent)
if match:
docid = match.group(1)
But when I want to search a long phrase, e.g.
I want to extract '$999,999.00', which may appear in the document like "Total Cumulative Payment (USD) $999,999.00" or "Total cumulative payment $55587323.23". Note that there is a difference in the text and I need to use some kind of fuzzy search, find the sentence, somehow extract the $ from there.
Similarly I also need to extract some date, number, amount, money in between phrases/words.
Appreciate your help!

I think this should do what you want:
import re
textlist = ["some other amount as $32,4545.34 and Total Cumulative Payment (USD) $999,999.00 and such","Total cumulative payment $55587323.23"]
matchlist = []
for text in textlist:
match = re.findall("(\$[.\d,]+)", text)
if match:
matchlist.extend(match)
print(matchlist)
results:
['$32,4545.34', '$999,999.00', '$55587323.23']
The regex is look for a $ and grab ., and numbers up to the next space. Depending on what other kind of data you are parsing it may need to be tweaked, I assuming you only want to capture periods, commas and numbers.
update:
it will now find any number of occurrences and put them all in a list

Well something like this can be done with regular expressions:
import re
source = 'total cumulative payment $2000.00; some other amount $1234.56. Total Cumulative Payment (USD) $5,600,000.06'
matches = re.findall( r'total\s+cumulative\s+payment[^$0-9]+\$([0-9,.]+)', source, re.IGNORECASE )
amounts = [ float( x.replace( ',', '' ).rstrip('.') ) for x in matches ]
This will match the two specific examples you've given. But you haven't given much of an idea of how loose the matching criteria should be, or what the rules are. The solution above will miss amounts if the source document has a spelling mistake in the word "cumulative". Or if the amount appears without the dollar sign. It also allows any amount of intervening text between "total cumulative payment" and the dollar amount (so you'll get a false positive from source = "This document contains information about total cumulative payment values, (...3 more pages of introductory material...) and by the way you owe me $20.") Now, these things can be tweaked and improved - but only if you know what is going to be important and what is not, and tighten the specification of the question accordingly.

Related

Get character followed by new line item using regex

I am trying to get the character on a new line after a specific letter using regex. My raw data looks like the below:
Total current charges (please see Current account details) $38,414.69
ID Number
1001166UNBEB
ACCOUNT SUMMARY
SVL0
BALANCE OVERDUE - PLEASE PAY IMMEDIATELY $42,814.80
I want to get the ID Number
My attempt is here:
ID_num = re.compile(r'[^ID Number[\r\n]+([^\r\n]+)]{12}')
The length of ID num is always 12, and always after ID Number which is why I am specifying the length in my expression and trying to detect the elements after that.
But this is not working as desired.
Would anyone help me, please?
Your regex is not working because of the use of [ ] at the beginning of the pattern, these are used for character sets.
So replace it with ( ).
Your pattern would look like: r'^ID Number[\r\n]+([^\r\n]+){12}'
But you can simplify your pattern to: ID Number[\s]+(\w+)
\r\n will be matched in \s and numbers and alpha chars in \w.
import re
s = """
Total current charges (please see Current account details) $38,414.69
ID Number
1001166UNBEB
ACCOUNT SUMMARY
SVL0
BALANCE OVERDUE - PLEASE PAY IMMEDIATELY $42,814.80
"""
print(re.findall(r"ID Number[\s]+(\w+)", s))
# ['1001166UNBEB']

Count the number of approximate string matches in text corpus using fuzzy matching

I am trying to find the number of times a string appears in a long text. When there is no exact match, I want to use fuzzy partial matching to find out whether the target string partially matches a substring in the text corpus. So far, this is what I have:
from fuzzywuzzy import fuzz
# Target name and text corpus
target_name = 'Example Company S.A.'
text = '''This is an example text, containing various
words around the company
name I actually want to find. This name is Example
company SA, and as you can
see this will not return an exact match. Here is the
name again with some modifications: example company.'''
# Get number of exact matches
n_matches = text.count(target_name) # This is 0
# Score for fuzzy matching after tokenization
fuzz.partial_token_set_ratio(target_name, text) # This is 100
My question is: is there any way to count the number of times this fuzzy matching finds the target name (after setting a score threshold, i.e. 85) within the text corpus? I am pretty new to this kind of techniques and I have not found any example doing exactly this.

removing characters off the end of a string

i am displaying prices of graphics cards from newegg using web scraping. On some of the text i scrape there is unwanted text after the price that gets scraped too. what is the most efficient way to only display the text of the price nothing more.
price_container = container.findAll("li", {"class": "price-current"})
price = price_container[0].text
if len(price) > 7:
the prices(bit i want to keep) are never more than 7 characters long so i thought i could remove the unwanted text using this if statement but im not sure how because each price has different length of unwanted text after it.
Use a regular expression:
import re
m = re.search(r'\$([\d.]+)', price)
if m:
print(m.group(0)) # to include the dollar sign
print(m.group(1)) # the amount without the dollar sign
You can either use a regular expression.
Or take a string and extract the numbers from it.
Example:
[float(p) for p in price.split() if p.isdigit()] # Will give you an array of the numbers in the string. You can then join them back together.
Perhaps not exactly what you are looking for, but hopefully will help you :)
if len(price) > 7:
price = price[:-1] #This will reasign the string/list to a string/list with all the characters except for the last one.

How to print the whole lines of the interface based on value using regex in python [duplicate]

This is a sample of the text I am working with.
6) Jake's Taxi Service is a new entrant to the taxi industry. It has achieved success by staking out a unique position in the industry. How did Jake's Taxi Service mostly likely achieve this position?
A) providing long-distance cab fares at a higher rate than
competitors; servicing a larger area than competitors
B) providing long-distance cab fares at a lower rate than competitors;
servicing a smaller area than competitors
C) providing long-distance cab fares at a higher rate than
competitors; servicing the same area as competitors
D) providing long-distance cab fares at a lower rate than competitors;
servicing the same area as competitors
Answer: D
I am trying to match the entire question including the answer options. Everything from the question number to the word Answer
This is my current regex expression
((rf'(?<={searchCounter}\) ).*?(?=Answer).*'), re.DOTALL)
SearchCounter is just a variable that will correspond to the current question, in this case 6. I think the issue is something to do with searching across the new lines.
EDIT: Full source code
searchCounter = 1
bookDict = {}
with open ('StratMasterKey.txt', 'rt') as myfile:
for line in myfile:
question_pattern = re.compile((rf'(?<={searchCounter}\) ).*?(?=Answer).*'), re.DOTALL)
result = question_pattern.search(line)
if result != None:
bookDict[searchCounter] = result[0]
searchCounter +=1
The reason your regex fails is that you read the file line by line with for line in myfile:, while your pattern searches for matches in a single multiline string.
Replace for line in myfile: with contents = myfile.read() and then use result = question_pattern.search(contents) to get the first match, or result = question_pattern.findall(contents) to get multiple matches.
A note on the regex: I am not fixing the whole pattern since, as you mentioned, it is out of scope of this question, but since the string input is a multiline string now, you need to remove re.DOTALL and use [\s\S] to match any char in the pattern and . to match any char but a line break char. Also, the lookaround contruct is redundant, you may safely replace (?=Answer) with Answer. Also, to check if there is a match, you may simply use if result: and then grab the whole match value by accessing result.group().
Full code snippet:
with open ('StratMasterKey.txt', 'rt') as myfile:
contents = myfile.read()
question_pattern = re.compile((rf'(?<={searchCounter}\) )[\s\S]*?Answer.*'))
result = question_pattern.search(contents)
if result:
print( result.group() )

regex count occurrences

I am looking for a way to count the occurrences found in the string based on my regex. I used findall() and it returns a list but then the len() of the list is only 1? shouldn't the len() of the list be 2?
import re
string1 = r'Total $200.00 Total $900.00'
regex = r'(.*Total.*|.*Invoice.*|.*Amount.*)?(\s+?\$\s?[1-9]{1,10}.*(?:
[.,]\d{3})*(?:[.,]\d{2})?)'
patt = re.findall(regex,string1)
print(patt)
print(len(patt))
Resut:
> [('Total $200.00 Total', ' $900.00')]
> 1
not sure if my regex is causing it to miscalculate. I am looking to get the Total from a file but there are many combinations of this.
Examples:
Total $900.00
Invoice Amt $500.00
Total 800.00
etc.
I am looking to count this because there could be multiple invoice details in one file.
First off, because that's a common misconception:
There is no need to match "all text up to the match" or "all the text after a match". You can drop those .* in your regex. Start with what you actually want to match.
import re
string1 = 'Total $200.00 Total $900.00'
amount_pattern = r'(?:Total|Amt|Invoice Amt|Others)[:\s]*\$([\d\.,]*\d)'
amount_expr = re.compile(amount_pattern, re.IGNORECASE)
amount_expr.findall(string1)
# -> ['200.00', '900.00']
\$([\d\.,]*\d) is a half-way reasonable approximation of prices ("things that start with a $ and then contain a bunch of digits and possibly dots and commas"). The final \d makes sure we are not accidentally matching sentence punctuation. It might be good enough, but you know what data you are working with. Feel free to come up with a more specific sub-expression. Include an optional leading - if you expect to see negative amounts.
Try:
>>> re.findall(r'(\w*\s+\$\d+\.\d+)', string1)
['Total $200.00', 'Total $900.00']
The issue you are having is your regex has two capture groups so re.findall returns a tuple of those two matches. One tuple with two matches inside has a length of 1.

Categories