phonenumbers python module not giving correct country code

phonenumbers python module not giving correct country code - python

I am trying to use the phone numbers module in python and am stuck with the below issue, it is giving the country code wrongly; both are US phone numbers.Can someone suggest how to proceeed
import phonenumbers
print(phonenumbers.parse("+301.795.1400"))
Output: Country Code: 30 National Number: 17951400 ---Wrong
print(phone numbers.parse("+1301.795.1400")) --- ( After Adding +1 or removing '+' it becomes correct)
output: Country Code: 1 National Number: 3017951400
For example :
+44 7923 903949 -- Country Code +44 which is correct
+782-205-2583 --Country Code +7 which is wrong
My expectation is +1 as country code ,phone number as 782-205-2583

A plus ('+') means that the following digit or digits are a country code. The Country code for the US is '1' (ie '+1'). You're putting the plus, telling the parser that the next digit or digits is a country code, but then omitting the country code that you need.
It looks to me like the module is working correctly.
see:
https://countrycode.org/

Related

Given a string, extract all the necessary information about the person

In my homework, I need to extract the first name, last name, ID code, phone number, date of birth and address of a person from a given string using Regex. The order of the parameters always remains the same. Each parameter requires a separate pattern.
Requirements are as follows:
Both first and last names always begin with a capital letter followed by at least one lowercase letter.
ID code is always 11 characters long and consists only of numbers.
The phone number itself is a combination of 7-8 numbers. The phone number might be separated from the area code with a whitespace, but not necessarily. It is also possible that there is no area code at all.
Date of birth is formatted as dd-MM-YYYY
Address is everything else that remains.
I got the following patterns for each parameter:
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
first_name_pattern = r"^[A-Z][a-z]+"
last_name_pattern = r"[A-z][a-z]+(?=[0-9])"
id_code_pattern = r"\d{11}(?=\+)"
phone_number_pattern = r"\+\d{3}?\s*\d{7,8}"
date_pattern = r"\d{1,2}\-\d{1,2}\-\d{1,4}"
address_pattern = r"[A-Z][a-z]*\s.*$"
first_name_match = re.findall(first_name_pattern, str1)
last_name_match = re.findall(last_name_pattern, str1)
id_code_match = re.findall(id_code_pattern, str1)
phone_number_match = re.findall(phone_number_pattern, str1)
date_match = re.findall(date_pattern, str1)
address_match = re.findall(address_pattern, str1)
So, given "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti", I get ['Heino'] ['Plekk'] ['69712047623'] ['+372 56887364' ] ['12-09-2020'] ['Tartu mnt 183,Tallinn,16881,Eesti'], which suits me perfectly.
The problem starts when the area code is missing, because now id_code_pattern can't find the id code because of (?=\+), and if one tries to use |\d{11} (or) there is another problem because now it finds both id code and phone number (69712047623 and 37256887364). And how to improve phone_number_pattern so that it finds only 7 or 8 digits of the phone number, I do not understand.

A single expression with some well-crafted capture groups will help you immensely:
import re
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
pattern = r"^(?P<first_name>[A-Z][a-z]+)(?P<last_name>[A-Z][a-z]+)(?P<id_code>\d{11})(?P<phone>(?:\+\d{3})?\s*\d{7,8})(?P<dob>\d{1,2}\-\d{1,2}\-\d{1,4})(?P<address>.*)$"
print(re.match(pattern, str1).groupdict())
Repl.it | regex101
Result:
{'first_name': 'Heino', 'last_name': 'Plekk', 'id_code': '69712047623', 'phone': '+37256887364', 'dob': '12-09-2020', 'address': 'Tartu mnt 183,Tallinn,16881,Eesti'}

How to extract specific information from multi-line string

I have extracted some invoice related information from email body to Python strings, my next task is to extract the Invoice numbers from the string.
The format of emails could vary, hence it is getting difficult to find invoice number from the text. I also tried "Named Entity Recognition" from SpaCy but since in most of the cases the Invoice number is coming in next line from the heading 'Invoice' or 'Invoice#',the NER doesn't understand the relation and returns incorrect details.
Below are 2 examples of the text extracted from mail body:
Example - 1.
Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.
Example - 2.
Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19
My problem is that if I convert this entire text to a single string then this becomes something like this:
Invoice Date Purchase Order Due Date Balance 8754321 8/17/17
7200016508 9/16/18 140.72
As it is visible that the Invoice number (8754321 in this case) changed its position and doesn't follow the keyword "Invoice" anymore, which is more difficult to find.
My desired output is something like this:
Output Example - 1 -
8754321
5245344
Output Example - 2 -
7651234
9872341
I don't know how can I retrieve text just under keyword "Invoice" or "Invoice#" which is the invoice number.
Please let me know if further information is required. Thanks!!
Edit: The invoice number doesn't have any pre-defined length, it can be 7 digit or can be more than that.

Code per my comments.
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
index = line.find('Invoice')
Uses heuristic that the column header row is always camel case or capitals (ID). This would fail if say a heading was exactly 'Account no.' rather than 'Account No.'
# get all number at a certain index
for line in email.split('\n'):
words = line[index:].split()
if words == []: continue
word = words[0]
try:
print(int(word))
except:
continue
Reliability here depends on data. So in my code Invoice column must be first of table header. i.e. you can't have 'Invoice Date' before 'Invoice'. Obviously this would need fixing.

Going off what Andrew Allen was saying, as long as these 2 assumptions are true:
Invoice numbers are always exactly 7 numerical digits
Invoice numbers are always following a whitespace and followed by a whitespace
Using regex should work. Something along the lines of;
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)
invoice in this case has a list of 2 strings, ['8754321', '5245344']

Using Regex. re.findall
Ex:
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
email2 = """Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19 """
for eml in [email, email2]:
print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))
Output:
['8754321', '5245344']
['7651234', '9872341']
\b - regex boundaries
\d{7} - get 7 digit number

Python regex solution for extracting German address format

I'm trying hard to write a Python regex code for extracting German address as show below.
Abc Gmbh Ensisheimer Straße 6-8 79346 Endingen
Def Gmbh Keltenstr . 16 77971 Kippenheim Deutschland
Ghi Deutschland Gmbh 53169 Bonn
Jkl Gmbh Ensisheimer Str . 6 -8 79346 Endingen
I wrote the below code for extracting individual address components and also put them together as a single regex but still unable to detect the above addresses. Can anyone please help me with it?
# TEST COMPANY NAME
string = 'Telekom Deutschland Gmbh 53169 Bonn Datum'
result = re.findall(r'([a-zA-Zäöüß]+\s*?[A-Za-zäöüß]+\s*?[A-Za-zäöüß]?)',string,re.MULTILINE)
print(result)
# TEST STREET NAME
result = re.findall(r'([a-zA-Zäöüß]+\s*\.)',string)
print(result)
# TEST STREET NUMBER
result = re.findall(r'(\d{1,3}\s*[a-zA-Z]?[+|-]?\s*[\d{1,3}]?)',string)
print(result)
# TEST POSTAL CODE
result = re.findall(r'(\d{5})',string)
print(result)
# TEST CITY NAME
result = re.findall(r'([A-Za-z]+)?',string)
print(result)
# TEST COMBINED ADDRESS COMPONENTS GROUP
result = re.findall(r'([a-zA-Zäöüß]+\s+?[A-Za-zäöüß]+\s+?[A-Za-zäöüß]+\s+([a-zA-Zäöüß]+\s*\.)+?\s+(\d{1,3}\s*[a-zA-Z]?[+|-]?\s*[\d{1,3}]?)+\s+(\d{5})+\s+([A-Za-z]+))',string)
print(result)
Please note that my objective is that if any of these addresses are present in a huge paragraph of text then the regex should extract and print only the addresses. Can someone please help me?

I would opt against a regex solution and use libpostal instead, it has bindings for a couple of other languages (in your case for python, use postal). You will have to install libpostal separately, since it includes 1.8GB of training data.
The good thing is, you can give it address parts in any order, it will pick the right parts most of the time.
It uses machine learning, trained on OpenStreetMap data in many languages.
For the examples given, it would not necessarily require to cut the company name and country from the string:
from postal.parser import parse_address
parse_address('Telekom Deutschland Gmbh 53169 Bonn Datum')
[('telekom deutschland gmbh', 'house'),
('53169', 'postcode'),
('bonn', 'city'),
('datum', 'house')]
parse_address('Keltenstr . 16 77971 Kippenheim')
[('keltenstr', 'road'),
('16', 'house_number'),
('77971', 'postcode'),
('kippenheim', 'city')]

Python Regex to extract codes from a string

I have a string like -
Srting = "$33.53 with 2 coupon codes : \r\n\r\n1) CODEONE\r\n\r\n2)
CODETWO \r\n\r\nBoth coupons only work if you buy 1 by 1"
I want to extract coupon codes "CODEONE" and "CODETWO" from this string if the following if condition gets true -
if "coupon code" in string:
Please help how i can extract these coupon codes? Actually i need a generic RE for this because i may have other strings where location of the codes may occur at different place and it is also possible that there is only one code

This might help.
import re
Srting = "$33.53 with 2 coupon codes : \r\n\r\n1) CODEONE\r\n\r\n2) CODETWO \r\n\r\nBoth coupons only work if you buy 1 by 1"
for i in re.findall("\d+\)(.*)", Srting):
print(i.strip())
Output:
CODEONE
CODETWO

How do I get an input to equal a certain amount of characters?

Our teacher has told us to make a program that validates a postcode as a Northern Ireland postcode. That means it must have the letters "BT" in it and must equal 8 characters. In the code below I managed to get the letters part working. However, he did not go into detail on how to make the input equal 8 characters. He mentioned using .length() and validation(try and except), but I'm unsure how to use .length() to get 8 characters. Here's my code:
postcode = input("Please enter an Northern Ireland postcode:")
BT = "BT"
while BT not in postcode:
postcode = input("That isn't a Northern Ireland postcode. Try again:")
print("This is a Northern Ireland postcode")

To get the number of characters in a String, you can use len(postcode).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.