Extract data from a table-like structure using python

Extract data from a table-like structure using python - python

I would like to get a numeric or alphanumeric character from a table-like structure
This table-like structure may contain some rubbish data or unorganized data
For Example,
''' 5. Item | 6.Marks and 7. Numberand kind of packages; 8. Ori 9. Quantity (Gross weight or 10. Invoice
number ` numbers on description of goods including Conferring other measurement), and number(s)
packages HS Code (6 digits) and brand Criterion (see value (FOB) where RVC is and date of cnaommep(ainyf apipslsiucianbglet)h.irNdapmaertoyf Overleaf Notes) appppilied (see.Overilseaaff NoNtoteess)), minvvooice(s)
invoice UF applicable)
91501937'''
The goal is to get the numeric number under the field invoice
which is 91501937
Here this is an output from an OCR and I have the locations
This is how it looks like in the Searchable PDF formate.
The Problem here that Regex was not valid I tried tabula but this structure is considered rubbish for tabula
Tried some Regex Like re.search(r'(invvooice(s)).*(\d+)',first_string,re.DOTALL) but is very with regex and can get anything.

Took me a while but i figured it out at last. I wrote this code assuming the invoice number would always be at last, but it shouldnt be hard to edit it so it can be other places aswell.
Here is my solution
x = "5. Item | 6.Marks and 7. Numberand kind of packages; 8. Ori 9. Quantity (Gross weight or 10. Invoice number ` numbers on description of goods including Conferring other measurement), and number(s) packages HS Code (6 digits) and brand Criterion (see value (FOB) where RVC is and date of cnaommep(ainyf apipslsiucianbglet)h.irNdapmaertoyf Overleaf Notes) appppilied (see.Overilseaaff NoNtoteess)), minvvooice(s) invoice UF applicable) 91501937"
a = x.lower()
words = a.split()
wordlist = []
for word in words:
wordlist.append(word)
number = 0
for n in a:
try:
print('word number %d: %s' %(number,wordlist[number]))
number = number + 1
except IndexError:
break
print('here is your number: %s' %(wordlist[-1]))
Edit You dont need the part of the code that is for n in a it's only for tracking my progress

Related

In odoo How to Add Amount in Words / Text to Printed Invoice?

I want to print the total amount as text format in an invoice generated by using odoo.
Note that that I want to convert Indian rupee(INR) to text format.
example:
INR 1500
desired output: one thousand five hundred

You have to create a compute field and convert the amount to words in python.... below i have provided a example:
num_word = fields.Char(string="Amount In Words", compute='_compute_amount_in_word')
def _compute_amount_in_word(self):
for rec in self:
rec.num_word = str(rec.currency_id.amount_to_text(rec.amount_total)) + ' only'

Check if there are numbers around a keyword in a text file

I am having a text file 'Filter.txt' which contains a specific keyword 'D&O insurance'. I would check if there are numbers in the sentence which contains that keyword, as well as the 2 sentences before and after that.
For example, I have a long paragraphe like this:
"International insurance programs necessary for companies with global subsidiaries and offices. Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. D&O insurance grants cover on a claims-made basis. How much is enough? What and who is covered – and not covered? "
The target word is "D&O insurance." If I wanted to extract the target sentence (D&O insurance grants cover on a claims-made basis.) as well as the preceding and following sentences (Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. and How much is enough?), what would be a good approach?
This is what I'm trying to do so far. However I don't really know how to apply to find ways to check in the whole sentence and the ones around it.
for line in open('Filter.txt'):
match = re.search('D&O insurance(\d+)',line)
if match:
print match.group(1)
I'm new to programming, so I'm looking for the possible solutions for that purpose.
Thank you for your help!

Okay I'm going to take a stab at this. Assume string is the entire contents of your .txt file (you may need to clean the '/n's out).
You're going to want to make a list of potential sentence endings, use that list to find the index positions of the sentence endings, and then use THAT list to make a list of the sentences in the file.
string = "International insurance programs necessary for companies with global subsidiaries and offices. Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. D&O insurance grants cover on a claims-made basis. How much is enough? What and who is covered – and not covered?"
endings = ['! ', '? ','. ']
def pos_find(string):
lst = []
for ending in endings:
i = string.find(ending)
if i != -1:
lst.append(string.find(ending))
return min(lst)
def sort_sentences(string):
sentences = []
while True:
try:
i = pos_find(string)
sentences.append(string[0:i+1])
string = string[i+2:]
except ValueError:
sentences.append(string)
break
return sentences
sentences = sort_sentences(string)
Once you have the list of sentences (I got a little weary here, so forgive the spaghetti code - the functionality is there), you will need to comb through that list to find characters that could be integers (this is how I'm checking for numbers...but you COULD do it different).
for i in range(len(sentences)):
sentence = sentences[i]
match = sentence.find('D&O insurance')
print(match)
if match >= 0:
lst = [sentences[i-1],sentence, sentences[i+2]]
for j in range(len(lst)):
sen = lst[j]
for char in sen:
try:
int(char)
print(f'Found {char} in "{sen}", index {j}')
except ValueError:
pass
Note that you will have to make some modifications to capture multi-number numbers. This will just print something for each integer in the full number (i.e. it will print a statement for 1, 0, and 0 if it finds 100 in the file). You will also need to catch the two edge cases where the D&O insurance substring is found in the first or last sentences. In the code above, you would throw an error because there would be no i-1 (if it's the first) index location.

How can I extract numbers based on context of the sentence in python?

I tried using regular expressions but it doesn't do it with any context
Examples::
"250 kg Oranges for Sale"
"I want to sell 100kg of Onions at 100 per kg"

You can do something like this.
First you split the text in words and then you try to convert each word to a number.
If the word can be converted to a number, it is a number and if you are sure that a quantity is always followed by the word "kg", once you find the number you can test if the next word is "kg".
Then, depending on the result, you add the value to the respective array.
In this particular case, you have to assure the numbers are written alone (e.g. "100 kg" and not "100kg") otherwise it will not be converted.
string = "250 kg Oranges for Sale. I want to sell 100 kg of Onions at 100 per kg."
# Split the text
words_list = string.split(" ")
print(words_list)
# Find which words are numbers
quantity_array = []
price_array = []
for i in range(len(words_list)):
try:
number = int(words_list[i])
# Is it a price or a quantity?
if words_list[i + 1] == 'kg':
quantity_array.append(number)
else:
price_array.append(number)
except ValueError:
print("\'%s\' is not a number" % words_list[i])
# Get the results
print(quantity_array)
print(price_array)

How to extract specific information from multi-line string

I have extracted some invoice related information from email body to Python strings, my next task is to extract the Invoice numbers from the string.
The format of emails could vary, hence it is getting difficult to find invoice number from the text. I also tried "Named Entity Recognition" from SpaCy but since in most of the cases the Invoice number is coming in next line from the heading 'Invoice' or 'Invoice#',the NER doesn't understand the relation and returns incorrect details.
Below are 2 examples of the text extracted from mail body:
Example - 1.
Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.
Example - 2.
Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19
My problem is that if I convert this entire text to a single string then this becomes something like this:
Invoice Date Purchase Order Due Date Balance 8754321 8/17/17
7200016508 9/16/18 140.72
As it is visible that the Invoice number (8754321 in this case) changed its position and doesn't follow the keyword "Invoice" anymore, which is more difficult to find.
My desired output is something like this:
Output Example - 1 -
8754321
5245344
Output Example - 2 -
7651234
9872341
I don't know how can I retrieve text just under keyword "Invoice" or "Invoice#" which is the invoice number.
Please let me know if further information is required. Thanks!!
Edit: The invoice number doesn't have any pre-defined length, it can be 7 digit or can be more than that.

Code per my comments.
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
index = line.find('Invoice')
Uses heuristic that the column header row is always camel case or capitals (ID). This would fail if say a heading was exactly 'Account no.' rather than 'Account No.'
# get all number at a certain index
for line in email.split('\n'):
words = line[index:].split()
if words == []: continue
word = words[0]
try:
print(int(word))
except:
continue
Reliability here depends on data. So in my code Invoice column must be first of table header. i.e. you can't have 'Invoice Date' before 'Invoice'. Obviously this would need fixing.

Going off what Andrew Allen was saying, as long as these 2 assumptions are true:
Invoice numbers are always exactly 7 numerical digits
Invoice numbers are always following a whitespace and followed by a whitespace
Using regex should work. Something along the lines of;
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)
invoice in this case has a list of 2 strings, ['8754321', '5245344']

Using Regex. re.findall
Ex:
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
email2 = """Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19 """
for eml in [email, email2]:
print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))
Output:
['8754321', '5245344']
['7651234', '9872341']
\b - regex boundaries
\d{7} - get 7 digit number

A Python regex to find soccer team fixtures in string

I am using the Requests module to access the HTML from my target website and then using Beautiful Soup to select a specific element on the website. The element in question is a table that contains the results thus far of the English Premier League 2016/2017 season. The table contains the match date, the teams involved, the full-time score and the half-time score. I want to use Python to parse the HTML of the table element and extract the fixtures listed on there. The teams are always listed as:
Team A - Team B
A team name can be 1-3 separate strings (e.g. Burnley, Manchester United, West Ham United.
My attempt so far is:
import re
teamsRegex = re.compile(r'((\w+\s)+-(\s\w+)+)')
My logic here is that the first team can be 1-3 separate strings in length and each string is always followed by a white space. Therefore, the pattern (\w+\s)+ represents a string of any length followed by a white space and can be repeated 1 or many times. The second team name will always begin with a white space following the "-" character and again can be a string of any length, repeated 1 or many times (\s\w+)+.
I'm sort of achieving the desired results but the above is not entirely correct. I am returned a list with my desired result at index 0 followed by the first string of index 0 as index 1, and the last string in index 0 as index 2.
Example string:
'Burnley - Swansea City align=center width=45> 0 - 1 align=center> (0-0)'
Regex finds:
[('Burnley - Swansea City', 'Burnley ', ' City'), ('0 - 1', '0 ', ' 1')]
I would just like it to find [('Burnley - Swansea City')]
Many thanks in anticipation of any help!

r'(?:[A-Z][a-z]*\s)+-(?:\s[A-Z][a-z]*)+'
Here you have two non-capturing (?:, so you'll get the full match only) groups to match the teams' names. I chose to use letters explicitly, so the expressions only match words beginning with capital letters and exclude digits. You should change that if the teams' names can contain digits (like "BVB 09").
Depending on the HTML file's content one could add a final lookahead (?= align) to increase specifity.
Edit:
To match up to three capitals and optional '&'s, try this :
r'(?:[A-Z&]{1,3}[a-z]*\s)+-(?:\s[A-Z&]{1,3}[a-z]*)+'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.