Given a string, extract all the necessary information about the person - python

In my homework, I need to extract the first name, last name, ID code, phone number, date of birth and address of a person from a given string using Regex. The order of the parameters always remains the same. Each parameter requires a separate pattern.
Requirements are as follows:
Both first and last names always begin with a capital letter followed by at least one lowercase letter.
ID code is always 11 characters long and consists only of numbers.
The phone number itself is a combination of 7-8 numbers. The phone number might be separated from the area code with a whitespace, but not necessarily. It is also possible that there is no area code at all.
Date of birth is formatted as dd-MM-YYYY
Address is everything else that remains.
I got the following patterns for each parameter:
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
first_name_pattern = r"^[A-Z][a-z]+"
last_name_pattern = r"[A-z][a-z]+(?=[0-9])"
id_code_pattern = r"\d{11}(?=\+)"
phone_number_pattern = r"\+\d{3}?\s*\d{7,8}"
date_pattern = r"\d{1,2}\-\d{1,2}\-\d{1,4}"
address_pattern = r"[A-Z][a-z]*\s.*$"
first_name_match = re.findall(first_name_pattern, str1)
last_name_match = re.findall(last_name_pattern, str1)
id_code_match = re.findall(id_code_pattern, str1)
phone_number_match = re.findall(phone_number_pattern, str1)
date_match = re.findall(date_pattern, str1)
address_match = re.findall(address_pattern, str1)
So, given "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti", I get ['Heino'] ['Plekk'] ['69712047623'] ['+372 56887364' ] ['12-09-2020'] ['Tartu mnt 183,Tallinn,16881,Eesti'], which suits me perfectly.
The problem starts when the area code is missing, because now id_code_pattern can't find the id code because of (?=\+), and if one tries to use |\d{11} (or) there is another problem because now it finds both id code and phone number (69712047623 and 37256887364). And how to improve phone_number_pattern so that it finds only 7 or 8 digits of the phone number, I do not understand.

A single expression with some well-crafted capture groups will help you immensely:
import re
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
pattern = r"^(?P<first_name>[A-Z][a-z]+)(?P<last_name>[A-Z][a-z]+)(?P<id_code>\d{11})(?P<phone>(?:\+\d{3})?\s*\d{7,8})(?P<dob>\d{1,2}\-\d{1,2}\-\d{1,4})(?P<address>.*)$"
print(re.match(pattern, str1).groupdict())
Repl.it | regex101
Result:
{'first_name': 'Heino', 'last_name': 'Plekk', 'id_code': '69712047623', 'phone': '+37256887364', 'dob': '12-09-2020', 'address': 'Tartu mnt 183,Tallinn,16881,Eesti'}

Related

new to python (as in 1 week in) and need help getting pointed in the right direction

Need to write a code for a school lab.
Input is First name Middle name Last Name
Output needs to be Last name, First initial. Middle Initial.
It must also work with just first and last name.
Examples:
Input: Jane Ann Doe
Output: Doe, J. A.
Input: Jane Doe
Output: Doe, J.
Code thus far is:
# 2.12 Lab, input First name Middle name last name
# result to print Last name, fist initial. Middle initial period.
# result must account for user not having middle name
name = input()
tokens = name.split()
I do not understand how to write an if statement followed by print statement to get the desired output.
name = input("Enter name: ")
tokens = name.split()
if int(len(tokens)) > 2:
print(tokens[-1] + ",", tokens[0][0]+".", tokens[1][0]+".")
else:
print(tokens[-1] + ",", tokens[0][0]+".")
With what you have so far, tokens will be a list of the words you entered, such as ['Jane', 'Ann', 'Doe'].
What you need to do is to print out the last of those items in full, followed by a comma. Then each of the other items in order but with just the first letter followed by a period.
You can get the last item of a list x with x[-1]. You can get each of the others with a loop like:
for item in x[:-1]:
doSomethingWith(item)
And the first character of the string item can be extracted with item[0].
That should hopefully be enough to get you on your way.
If it's not enough, read on, though it would be far better for you if tou tried to nut it out yourself first.
...
No? Okay then, here we go ...
The following code shows one way you can do this, with hopefully enough comments that you will understand:
import sys
# Get line and turn into list of words.
inputLine = input("Please enter your full name: ")
tokens = inputLine.split()
print(tokens)
# Pre-check to make sure at least two words were entered.
if len(tokens) < 2:
print("ERROR: Need at least two tokens in the name.")
sys.exit(0)
# Print last word followed by comma, no newline (using f-strings).
print(f"{tokens[-1]},", end="")
# Process all but the last word.
for namePart in tokens[:-1]:
# Print first character of word followed by period, no newline.
print(f" {namePart[0]}.", end="")
# Make sure line is terminated by a newline character.
print()
You could no doubt make that more robust against weird edge cases like a first name of "." but it should be okay for an educational assignment.
But it handles even more complex names such as "River Rocket Blue Dallas Oliver" (yes, I'm serious, that's a real name).
# 2.12 Lab, input First name Middle name last name
# result to print Last name, fist initial. Middle initial period.
# result must account for user not having middle name
name = input()
tokens = name.split()
if len(tokens) == 2: # to identify if only two names entered
last_name = tokens[1]
first_init = tokens[0][0]
print(last_name, ',', first_init,'.',sep='')
if len(tokens) == 3: # to identify if three names entered
last_name = tokens[2]
first_init = tokens[0][0]
middle_init = tokens [1][0]
print(last_name, ',',' ',first_init,'.', ' ', middle_init,'.',sep='')
Try this code:
a=input()
name=a.split(" ")
index=len(name)
if index==3:
print(f"{name[-1]},{name[-3][0]}.{name[-2][0]}.")
else:
print(f"{name[-1]},{name[-2][0]}.")
Here is the explanation of the code:
First,using input(),we get the name of the person.
Then,we split the name using .split()with the parameter (written in the parenthesis) as " "
next we will find the no.of elements in the list (.split() returns a list) for the if statement
Then we print the output through the if statement shown above and using indexing ,we extract the first letter.

Regex to detect a phone number

I need to find a phone number in a given paragraph text, with the conditions as below.
The word Phone/Ph/tel/telephone should exist in the sentence where the phone number is present.
For ex: (consider the below paragraph.)
This is my Phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script.
As you can see this paragraph has a phone number signified, and it has the word "Phone" in the sentence (31 characters before the phone number).
So i would like to detect this as a phone number if and only if it has the words Phone/Ph/tel/telephone 50 characters before or after the phone number.
I tried using lookaround in regex but did not work.
import re
phno = re.compile(r'(?<=Ph\s)(?<=Phone\s)(?<=tel\s)telephone(?<=telephone\s)\b([0-9]{3}[-][0-9]{3}[-][0-9]{4})\b',re.MULTILINE)
data = "This is my phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script."
l = phno.findall(data)
print(l)
I am getting output empty list [ ] because the word 'Phone' is not detected by regex (I need it to detect 50 chars before or after phone number)
import re
data = """This is my phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 999-123-4567 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
And 555-555-1212 is my telephone."""
phno = re.compile(r'\b(?:phone|ph|telephone)\b.{0,49}\b(\d{3}[-]\d{3}[-]\d{4})\b|\b(\d{3}[-]\d{3}[-]\d{4})\b.{0,49}\b(?:phone|ph|telephone)\b', flags=re.I)
phones = [m.group(1) if m.group(1) else m.group(2) for m in phno.finditer(data)]
print(phones)
Prints:
['999-888-7894', '555-555-1212']
See demo
Assuming you only want to detect hyphen-separated US phone numbers containing area codes, you could use the following regex pattern with re.findall:
\b\d{3}-\d{3}-\d{4}\b
Script:
sentence = "This is my Phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script."
numbers = re.findall(r'\b\d{3}-\d{3}-\d{4}\b', sentence)
print(numbers)
This prints:
['999-888-7894']

How do I grab specific text in between other text?

I need help grabbing just K334-76A9 from this string:
b'\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n
Please help, I have tried so many things but none have worked.
Sorry if my question is bad :/
If you want to find the format xxxx-xxxx, no matter what string you have you can do it like this:
import re
b = '\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n'
splitString = []
splitString = b.split()
r = re.compile('.{4}-.{4}')
for string in splitString:
if r.match(string):
print(string)
Output:
K334-76A9
Here's code that grabs everything after "Serial Number is " up to the next whitespace character.
import re
data = b'\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n'
pat = re.compile(r"Serial Number is ([^\s]+)")
match = pat.search(data.decode("ASCII"))
if match:
print(match.group(1))
Result:
K334-76A9
You can adjust the regular expression per your needs. Regular expressions are Da Bomb! This one's really simple, but you can do amazingly complex things with them.

Need to search a string for a "two word" pattern in python

I’m trying to search a long string of characters for a country name. The country name is sometimes more than one word, such as Costa Rica.
Here is my code:
eol = len(CountryList)
for c in range(0, eol):
country = str(CountryList[c])
countrymatch = re.search(country, fullsampledata)
if countrymatch:
...
fullsampledata is a long string with all the data in one line. I’m trying to parse out the country by cycling thru a list of valid country names. If country is only one word, such as ‘Holland’, it finds it. However, if country is two or more words, ‘Costa Rica’, it doesn’t find it. Why?
You can search for a substring in a string using the .find() function as follows
fullsampledata = "hwfekfwekjfnkwfehCosta Ricakwjfkwfekfekfw"
fullsampledata.find("Morocco")
-1
fullsampledata.index("Costa Rica")
17
So you can make your if statement as follows
fullsampledata = "hwfekfwekjfnkwfehCosta Ricakwjfkwfekfekfw"
country = "Costa Rica"
if fullsampledata.index(country) != -1:
# Found
pass
else:
# Not Found
pass
In [1]: long_string = 'asdfsadfCosta Ricaasdkj asdfsd asdjas USA alsj'
In [2]: 'Costa Rica' in long_string
Out[2]: True
You don't have your code properly shown and I'm a little too lazy to parse it. Hope this helps.

Add letters to string conditionally

Input: 1 10 avenue
Desired Output: 1 10th avenue
As you can see above I have given an example of an input, as well as the desired output that I would like. Essentially I need to look for instances where there is a number followed by a certain pattern (avenue, street, etc). I have a list which contains all of the patterns and it's called patterns.
If that number does not have "th" after it, I would like to add "th". Simply adding "th" is fine, because other portions of my code will correct it to either "st", "nd", "rd" if necessary.
Examples:
1 10th avenue OK
1 10 avenue NOT OK, TH SHOULD BE ADDED!
I have implemented a working solution, which is this:
def Add_Th(address):
try:
address = address.split(' ')
except AttributeError:
pass
for pattern in patterns:
try:
location = address.index(pattern) - 1
number_location = address[location]
except (ValueError, IndexError):
continue
if 'th' not in number_location:
new = number_location + 'th'
address[location] = new
address = ' '.join(address)
return address
I would like to convert this implementation to regex, as this solution seems a bit messy to me, and occasionally causes some issues. I am not the best with regex, so if anyone could steer me in the right direction that would be greatly appreciated!
Here is my current attempt at the regex implementation:
def add_th(address):
find_num = re.compile(r'(?P<number>[\d]{1,2}(' + "|".join(patterns + ')(?P<following>.*)')
check_th = find_num.search(address)
if check_th is not None:
if re.match(r'(th)', check_th.group('following')):
return address
else:
# this is where I would add th. I know I should use re.sub, i'm just not too sure
# how I would do it
else:
return address
I do not have a lot of experience with regex, so please let me know if any of the work I've done is incorrect, as well as what would be the best way to add "th" to the appropriate spot.
Thanks.
Just one way, finding the positions behind a digit and ahead of one of those pattern words and placing 'th' into them:
>>> address = '1 10 avenue 3 33 street'
>>> patterns = ['avenue', 'street']
>>>
>>> import re
>>> pattern = re.compile(r'(?<=\d)(?= ({}))'.format('|'.join(patterns)))
>>> pattern.sub('th', address)
'1 10th avenue 3 33th street'

Categories