How to extract specific information from multi-line string - python

I have extracted some invoice related information from email body to Python strings, my next task is to extract the Invoice numbers from the string.
The format of emails could vary, hence it is getting difficult to find invoice number from the text. I also tried "Named Entity Recognition" from SpaCy but since in most of the cases the Invoice number is coming in next line from the heading 'Invoice' or 'Invoice#',the NER doesn't understand the relation and returns incorrect details.
Below are 2 examples of the text extracted from mail body:
Example - 1.
Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.
Example - 2.
Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19
My problem is that if I convert this entire text to a single string then this becomes something like this:
Invoice Date Purchase Order Due Date Balance 8754321 8/17/17
7200016508 9/16/18 140.72
As it is visible that the Invoice number (8754321 in this case) changed its position and doesn't follow the keyword "Invoice" anymore, which is more difficult to find.
My desired output is something like this:
Output Example - 1 -
8754321
5245344
Output Example - 2 -
7651234
9872341
I don't know how can I retrieve text just under keyword "Invoice" or "Invoice#" which is the invoice number.
Please let me know if further information is required. Thanks!!
Edit: The invoice number doesn't have any pre-defined length, it can be 7 digit or can be more than that.

Code per my comments.
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
index = line.find('Invoice')
Uses heuristic that the column header row is always camel case or capitals (ID). This would fail if say a heading was exactly 'Account no.' rather than 'Account No.'
# get all number at a certain index
for line in email.split('\n'):
words = line[index:].split()
if words == []: continue
word = words[0]
try:
print(int(word))
except:
continue
Reliability here depends on data. So in my code Invoice column must be first of table header. i.e. you can't have 'Invoice Date' before 'Invoice'. Obviously this would need fixing.

Going off what Andrew Allen was saying, as long as these 2 assumptions are true:
Invoice numbers are always exactly 7 numerical digits
Invoice numbers are always following a whitespace and followed by a whitespace
Using regex should work. Something along the lines of;
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)
invoice in this case has a list of 2 strings, ['8754321', '5245344']

Using Regex. re.findall
Ex:
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
email2 = """Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19 """
for eml in [email, email2]:
print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))
Output:
['8754321', '5245344']
['7651234', '9872341']
\b - regex boundaries
\d{7} - get 7 digit number

Related

Converting number in sentences to word in python

I am having a text where i have some number in the sentence , i only want to convert the number to word format. How can i solve this . I have written a code for that but that does not work as i am passing text to 'function' instead of number. how do i do that
I have tried the following code.
import num2words
def convert_num_to_words(utterance):
utterance = num2words(utterance)
return utterance
transcript = "If you can call the merchant and cancelled the transaction and confirm from them that they will not take the payment the funds will automatically be credited back into your account after 24 hours as it will expire on 11/04 Gemma"
print(convert_num_to_words("transcript"))
Expected result is
"If you can call the merchant and cancelled the transaction and confirm from them that they will not take the payment the funds will automatically be credited back into your account after twenty four hours as it will expire on 11/04 Gemma"
i.e. number 24 in text should be converted to word (Twenty four)
You need to do it to every word of the string, and only if it is numeric, and also remove the quotes beside transcript, also do num2words.num2words(...) not just num2words(...):
import num2words
def convert_num_to_words(utterance):
utterance = ' '.join([num2words.num2words(i) if i.isdigit() else i for i in utterance.split()])
return utterance
transcript = "If you can call the merchant and cancelled the transaction and confirm from them that they will not take the payment the funds will automatically be credited back into your account after 24 hours as it will expire on 11/04 Gemma"
print(convert_num_to_words(transcript))

How to remove duplicated words in csv rows in python?

I am working with csv file and I have many rows that contain duplicated words and I want to remove any duplicates (I also don't want to lose the order of the sentences).
csv file example (userID and description are the columns name):
userID, description
12, hello world hello world
13, I will keep the 2000 followers same I will keep the 2000 followers same
14, I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car
.
.
I would like to have the output as:
userID, description
12, hello world
13, I will keep the 2000 followers same
14, I paid $2000 to the car
.
.
I already tried the post such as 1 2 3 but none of them fixed my problem and did not change anything. (Order for my output file matters, since I don't want to lose the orders). It would be great if you can provide your help with a code sample that I can run in my side and learn.
Thank you
[I am using python 3.7 version]
To remove duplicates, I'd suggest a solution involving the OrderedDict data structure:
df['Desired'] = (df['Current'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))
The code below works for me:
a = pd.Series(["hello world hello world",
"I will keep the 2000 followers same I will keep the 2000 followers same",
"I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car"])
a.apply(lambda x: " ".join([w for i, w in enumerate(x.split()) if x.split().index(w) == i]))
Basically the idea is to, for each word, only keep it if its position is the first in the list (splitted from string using space). That means, if the word occurred the second (or more) time, the .index() function will return an index smaller than the position of current occurrence, and thus will be eliminated.
This will give you:
0 hello world
1 I will keep the 2000 followers same
2 I paid $2000 to the car
dtype: object
Solution taken from here:
def principal_period(s):
i = (s+s).find(s, 1)
return s[:i]
df['description'].apply(principal_period)
Output:
0 hello world
1 I will keep the 2000 followers the same
2 I paid $2000 to the car
Name: description, dtype: object
Since this uses apply on string, it might be slow.
Answer taken from How can I tell if a string repeats itself in Python?
import pandas as pd
def principal_period(s):
s+=' '
i = (s + s).find(s, 1, -1)
return None if i == -1 else s[:i]
df=pd.read_csv(r'path\to\filename_in.csv')
df['description'].apply(principal_period)
df.to_csv(r'output\path\filename_out.csv')
Explanation:
I have added a space at the end to account for that the repeating strings are delimited by space. Then it looks for second occurring string (minus first and last character to avoid matching first, and last when there are no repeating strings, respectively) when the string is added to itself. This efficiently finds the position of string where the second occuring string starts, or the first shortest repeating string ends. Then this repeating string is returned.

Extract data from a table-like structure using python

I would like to get a numeric or alphanumeric character from a table-like structure
This table-like structure may contain some rubbish data or unorganized data
For Example,
''' 5. Item | 6.Marks and 7. Numberand kind of packages; 8. Ori 9. Quantity (Gross weight or 10. Invoice
number ` numbers on description of goods including Conferring other measurement), and number(s)
packages HS Code (6 digits) and brand Criterion (see value (FOB) where RVC is and date of cnaommep(ainyf apipslsiucianbglet)h.irNdapmaertoyf Overleaf Notes) appppilied (see.Overilseaaff NoNtoteess)), minvvooice(s)
invoice UF applicable)
91501937'''
The goal is to get the numeric number under the field invoice
which is 91501937
Here this is an output from an OCR and I have the locations
This is how it looks like in the Searchable PDF formate.
The Problem here that Regex was not valid I tried tabula but this structure is considered rubbish for tabula
Tried some Regex Like re.search(r'(invvooice(s)).*(\d+)',first_string,re.DOTALL) but is very with regex and can get anything.
Took me a while but i figured it out at last. I wrote this code assuming the invoice number would always be at last, but it shouldnt be hard to edit it so it can be other places aswell.
Here is my solution
x = "5. Item | 6.Marks and 7. Numberand kind of packages; 8. Ori 9. Quantity (Gross weight or 10. Invoice number ` numbers on description of goods including Conferring other measurement), and number(s) packages HS Code (6 digits) and brand Criterion (see value (FOB) where RVC is and date of cnaommep(ainyf apipslsiucianbglet)h.irNdapmaertoyf Overleaf Notes) appppilied (see.Overilseaaff NoNtoteess)), minvvooice(s) invoice UF applicable) 91501937"
a = x.lower()
words = a.split()
wordlist = []
for word in words:
wordlist.append(word)
number = 0
for n in a:
try:
print('word number %d: %s' %(number,wordlist[number]))
number = number + 1
except IndexError:
break
print('here is your number: %s' %(wordlist[-1]))
Edit You dont need the part of the code that is for n in a it's only for tracking my progress

Python Regex to extract codes from a string

I have a string like -
Srting = "$33.53 with 2 coupon codes : \r\n\r\n1) CODEONE\r\n\r\n2)
CODETWO \r\n\r\nBoth coupons only work if you buy 1 by 1"
I want to extract coupon codes "CODEONE" and "CODETWO" from this string if the following if condition gets true -
if "coupon code" in string:
Please help how i can extract these coupon codes? Actually i need a generic RE for this because i may have other strings where location of the codes may occur at different place and it is also possible that there is only one code
This might help.
import re
Srting = "$33.53 with 2 coupon codes : \r\n\r\n1) CODEONE\r\n\r\n2) CODETWO \r\n\r\nBoth coupons only work if you buy 1 by 1"
for i in re.findall("\d+\)(.*)", Srting):
print(i.strip())
Output:
CODEONE
CODETWO

How to remove html commands

I have a .htm document. By using text_content(), I extracted the text from the document.
Here is the text:
'PART II \xa0\r\nThe Company\x92s common stock is traded on the over-the-counter market and is quoted on the NASDAQ Global Select Market under the symbol\r\nAAPL and on the Frankfurt Stock Exchange under the symbol APCD. Price Range of Common Stock The price range per share of common stock presented below represents the highest and lowest sales prices for the Company\x92s common stock\r\non the NASDAQ Global Select Market during each quarter of the two most recent years. \xa0\r\nHolders As of October\xa016, 2009, there were 30,573 shareholders of record. Dividends\r\n The Company did not declare or pay cash dividends in either 2009 or 2008. The Company anticipates that for the foreseeable\r\nfuture it will retain any earnings for use in the operation of its business. Purchases of Equity Securities by the Issuer and Affiliated\r\nPurchasers None. \xa0\r\n 33 '
With this text, I need to remove a heading which is preceded and followed by a blank line. Thus, lines of the following form should be removed:
\n
some text here\n
\n
I have a code that does that for .txt version of document. However, from .htm document, I realized that some weird things like \xa0\r\n are used to make words capital (for example). Is there any way to remove all of these things and correctly remove only the headings?
Here is the function that does remove the heading:
def clean_text_passage(a_text_string):
"""REMOVE /n: take a list of strings (some passage of text)
and remove noise which is defined as lines that are preceded
by a blank line and followed by a blank line that is lines of
this form will not be in the output
\n
some text here\n
\n
"""
new_passage=[]
p=[line+'\n' for line in a_text_string.split('\n')]
passage = [w.lower().replace('</b>\n', '\n') for w in p]
if len(passage[0].strip())>0:
if len(passage[1].strip())>0:
new_passage.append(passage[0])
for counter, text_line in enumerate(passage[:-1]):
len_line_before=len(passage[counter-1].strip())
len_line_after=len(passage[counter+1].strip())
if len_line_before==len_line_after==0:
continue
if len(text_line.strip())!=0:
new_passage.append(text_line)
if len(passage[-2].strip())!=0:
if len(passage[-1].strip())!=0:
new_passage.append(passage[-1])
return new_passage
I guess the key is to identify the heading in htm document.
Thank you so much for your time and help.

Categories