Converting number in sentences to word in python

Converting number in sentences to word in python - python

I am having a text where i have some number in the sentence , i only want to convert the number to word format. How can i solve this . I have written a code for that but that does not work as i am passing text to 'function' instead of number. how do i do that
I have tried the following code.
import num2words
def convert_num_to_words(utterance):
utterance = num2words(utterance)
return utterance
transcript = "If you can call the merchant and cancelled the transaction and confirm from them that they will not take the payment the funds will automatically be credited back into your account after 24 hours as it will expire on 11/04 Gemma"
print(convert_num_to_words("transcript"))
Expected result is
"If you can call the merchant and cancelled the transaction and confirm from them that they will not take the payment the funds will automatically be credited back into your account after twenty four hours as it will expire on 11/04 Gemma"
i.e. number 24 in text should be converted to word (Twenty four)

You need to do it to every word of the string, and only if it is numeric, and also remove the quotes beside transcript, also do num2words.num2words(...) not just num2words(...):
import num2words
def convert_num_to_words(utterance):
utterance = ' '.join([num2words.num2words(i) if i.isdigit() else i for i in utterance.split()])
return utterance
transcript = "If you can call the merchant and cancelled the transaction and confirm from them that they will not take the payment the funds will automatically be credited back into your account after 24 hours as it will expire on 11/04 Gemma"
print(convert_num_to_words(transcript))

Related

In odoo How to Add Amount in Words / Text to Printed Invoice?

I want to print the total amount as text format in an invoice generated by using odoo.
Note that that I want to convert Indian rupee(INR) to text format.
example:
INR 1500
desired output: one thousand five hundred

You have to create a compute field and convert the amount to words in python.... below i have provided a example:
num_word = fields.Char(string="Amount In Words", compute='_compute_amount_in_word')
def _compute_amount_in_word(self):
for rec in self:
rec.num_word = str(rec.currency_id.amount_to_text(rec.amount_total)) + ' only'

How to remove duplicated words in csv rows in python?

I am working with csv file and I have many rows that contain duplicated words and I want to remove any duplicates (I also don't want to lose the order of the sentences).
csv file example (userID and description are the columns name):
userID, description
12, hello world hello world
13, I will keep the 2000 followers same I will keep the 2000 followers same
14, I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car
.
.
I would like to have the output as:
userID, description
12, hello world
13, I will keep the 2000 followers same
14, I paid $2000 to the car
.
.
I already tried the post such as 1 2 3 but none of them fixed my problem and did not change anything. (Order for my output file matters, since I don't want to lose the orders). It would be great if you can provide your help with a code sample that I can run in my side and learn.
Thank you
[I am using python 3.7 version]

To remove duplicates, I'd suggest a solution involving the OrderedDict data structure:
df['Desired'] = (df['Current'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))

The code below works for me:
a = pd.Series(["hello world hello world",
"I will keep the 2000 followers same I will keep the 2000 followers same",
"I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car"])
a.apply(lambda x: " ".join([w for i, w in enumerate(x.split()) if x.split().index(w) == i]))
Basically the idea is to, for each word, only keep it if its position is the first in the list (splitted from string using space). That means, if the word occurred the second (or more) time, the .index() function will return an index smaller than the position of current occurrence, and thus will be eliminated.
This will give you:
0 hello world
1 I will keep the 2000 followers same
2 I paid $2000 to the car
dtype: object

Solution taken from here:
def principal_period(s):
i = (s+s).find(s, 1)
return s[:i]
df['description'].apply(principal_period)
Output:
0 hello world
1 I will keep the 2000 followers the same
2 I paid $2000 to the car
Name: description, dtype: object
Since this uses apply on string, it might be slow.

Answer taken from How can I tell if a string repeats itself in Python?
import pandas as pd
def principal_period(s):
s+=' '
i = (s + s).find(s, 1, -1)
return None if i == -1 else s[:i]
df=pd.read_csv(r'path\to\filename_in.csv')
df['description'].apply(principal_period)
df.to_csv(r'output\path\filename_out.csv')
Explanation:
I have added a space at the end to account for that the repeating strings are delimited by space. Then it looks for second occurring string (minus first and last character to avoid matching first, and last when there are no repeating strings, respectively) when the string is added to itself. This efficiently finds the position of string where the second occuring string starts, or the first shortest repeating string ends. Then this repeating string is returned.

How to extract specific information from multi-line string

I have extracted some invoice related information from email body to Python strings, my next task is to extract the Invoice numbers from the string.
The format of emails could vary, hence it is getting difficult to find invoice number from the text. I also tried "Named Entity Recognition" from SpaCy but since in most of the cases the Invoice number is coming in next line from the heading 'Invoice' or 'Invoice#',the NER doesn't understand the relation and returns incorrect details.
Below are 2 examples of the text extracted from mail body:
Example - 1.
Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.
Example - 2.
Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19
My problem is that if I convert this entire text to a single string then this becomes something like this:
Invoice Date Purchase Order Due Date Balance 8754321 8/17/17
7200016508 9/16/18 140.72
As it is visible that the Invoice number (8754321 in this case) changed its position and doesn't follow the keyword "Invoice" anymore, which is more difficult to find.
My desired output is something like this:
Output Example - 1 -
8754321
5245344
Output Example - 2 -
7651234
9872341
I don't know how can I retrieve text just under keyword "Invoice" or "Invoice#" which is the invoice number.
Please let me know if further information is required. Thanks!!
Edit: The invoice number doesn't have any pre-defined length, it can be 7 digit or can be more than that.

Code per my comments.
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
index = line.find('Invoice')
Uses heuristic that the column header row is always camel case or capitals (ID). This would fail if say a heading was exactly 'Account no.' rather than 'Account No.'
# get all number at a certain index
for line in email.split('\n'):
words = line[index:].split()
if words == []: continue
word = words[0]
try:
print(int(word))
except:
continue
Reliability here depends on data. So in my code Invoice column must be first of table header. i.e. you can't have 'Invoice Date' before 'Invoice'. Obviously this would need fixing.

Going off what Andrew Allen was saying, as long as these 2 assumptions are true:
Invoice numbers are always exactly 7 numerical digits
Invoice numbers are always following a whitespace and followed by a whitespace
Using regex should work. Something along the lines of;
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)
invoice in this case has a list of 2 strings, ['8754321', '5245344']

Using Regex. re.findall
Ex:
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
email2 = """Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19 """
for eml in [email, email2]:
print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))
Output:
['8754321', '5245344']
['7651234', '9872341']
\b - regex boundaries
\d{7} - get 7 digit number

How to extract specific information from emails using machine learning?

I have multiple emails with a list of stock, price and quantity. Each day, the list is formatted a little differently and I was hoping to use NLP to try to understand read in the data and reformat it to show the information in a correct format.
Here is a sample of the emails I receive:
Symbol Quantity Rate
AAPL 16 104
MSFT 8.3k 56.24
GS 34 103.1
RM 3,400 -10
APRN 6k 11
NP 14,000 -44
As we can see, the quantity is in varying formats, the ticker always is standard but the rate is either positive or negative or could have decimals. Another issue is that the headers are not always the same so that is not an identifier that I can rely on.
So far I've seen some examples online where this works for names but I am unable to implement this for stock ticker, quantity and price. The code I've tried so far is below:
import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
string = """
To: "Anna Jones" <anna.jones#mm.com>
From: James B.
Hey,
This week has been crazy. Attached is my report on IBM. Can you give it a quick read and provide some feedback.
Also, make sure you reach out to Claire (claire#xyz.com).
You're the best.
Cheers,
George W.
212-555-1234
"""
def extract_phone_numbers(string):
r = re.compile(r'(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})')
phone_numbers = r.findall(string)
return [re.sub(r'\D', '', number) for number in phone_numbers]
def extract_email_addresses(string):
r = re.compile(r'[\w\.-]+#[\w\.-]+')
return r.findall(string)
def ie_preprocess(document):
document = ' '.join([i for i in document.split() if i not in stop])
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
def extract_names(document):
names = []
sentences = ie_preprocess(document)
for tagged_sentence in sentences:
for chunk in nltk.ne_chunk(tagged_sentence):
if type(chunk) == nltk.tree.Tree:
if chunk.label() == 'PERSON':
names.append(' '.join([c[0] for c in chunk]))
return names
if __name__ == '__main__':
numbers = extract_phone_numbers(string)
emails = extract_email_addresses(string)
names = extract_names(string)
print(numbers)
print(emails)
print(names)
This code does a good job with numbers, emails and names but I am unable to replicate this for the example I have and do not really know how to go about it. Any tips will be more than helpful.

You can construct the regexes that will check for numbers and amounts.
For the sticks however, you will have to do something differently. I suspect that the stock names are not always written in uppercase letters in email. If they are then just write a script that will utilize an API from some of the stock exchanges and run only the words that have all the letters in the uppercase form. But, if the stock names are not written in uppercase letters in the emails, you can do several things. You can check every word from the email against that stock exchange if it's a stick name. If you want to speed up that process, you can try doing dependency parsing and run only the nouns or pronouns against the API.

How do I print the next item in a list that starts with a certain letter/symbol?

So I am making this troubleshooting system related to phones in Python that needs to give a solution after looking at the user's query. In my list, the keywords that will be matched to the user's query are first put in and then the relevant solution to those keywords. The problem is that different issues that the user might have have different number of keywords to choose from and if all of the solutions in my list start with a symbol such as { how can I write a code that prints the next item in the list starting with { ?
storage = ["wet", "water", "toilet", "{Wipe the phone and place it in a bag of rice for 24 hours.}"]
This is an example of the list that I made. The user's query is: "I dropped my phone in the toilet."
The solution to this problem is right after the word 'toilet' starting with a curly bracket. Can you please provide me with a code that will make the program print the next value in the list that starts with a curly bracket?

Given a starting point like:
query = "I dropped my phone in the toilet."
storage = ["wet", "water", "toilet", "{Wipe the phone and place it in a bag of rice for 24 hours.}"]
First we would need some kind of flag that indicates when we have found a matching keyword, lets say:
found_word = False
then we just iterate over storage like so:
for word in storage:
however not all the entries in storage are keywords, some are special values that need to be treated differently:
for word in storage:
if word.startswith("{"):
When we encounter a value like this, if we have found a keyword we want to print out this special value then stop looping:
if word.startswith("{"):
if found_word:
print(word)
break
otherwise if the keyword is in the query then we just set the flag to True:
elif word in query:
found_word = True
so our final code would be:
found_word = False
for word in storage:
if word.startswith("{"):
if found_word:
print(word)
break
elif word in query:
found_word = True
on the other hand, if you used a dict to store your data like:
wet_solve = "Wipe the phone and place it in a bag of rice for 24 hours."
solutions = {"wet":wet_solve, "water":wet_solve, "toilet":wet_solve}
Then you would just need to check all the words in the query for one in the solutions:
for word in query.split():
if word in solutions:
print(solutions[word])

The following will do exactly what you asked: "Can you please provide me with a code that will make the program print the next value in the list that starts with a curly bracket?"
def findNextCurly(keyword):
index = storage.index(keyword) + 1
while not storage[index].startswith("{"):
index = index+1
print (storage[index][1:-1])
>>> findNextCurly('test')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in findNextCurly
ValueError: 'test' is not in list
>>> findNextCurly('wet')
Wipe the phone and place it in a bag of rice for 24 hours.
... but nothing more.
The notation string[1:-1] is called string slicing and is explained in the official tutorial. Here it is used to remove the first and last characters from the string – the curly brackets.

The solution given by Rad Lexus will return any value that has a curly brace in it, i.e. it would also print hello{brace. If you only want the ones that start with a letter, try:
storage = ["wet", "water", "toilet", "{Wipe the phone and place it in a bag of rice for 24 hours.}"]
for s in storage:
if s.startswith("{"):
print(s)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.