Compare two strings and Extract value of variable data in Python - python

In my python script,
I have a list of strings like,
birth_year = ["my birth year is *","i born in *","i was born in *"]
I want to compare one input sentence with the above list and need a birth year as output.
The input sentence is like:
Example1: My birth year is 1994.
Example2: I born in 1995
The output will be:
Example1: 1994
Example2: 1995
I applied many approaches by using regex. But I didn't find a perfect solution for the same.

If you change birth_year to a list of regexes you could match more easily with your input string. Use a capturing group for the year.
Here's a function that does what you want:
def match_year(birth_year, input):
for s in birth_year:
m = re.search(s, input, re.IGNORECASE)
if m:
output = f'{input[:m.start(0)]}{m[1]}'
print(output)
break
Example:
birth_year = ["my birth year is (\d{4})","i born in (\d{4})","i was born in (\d{4})"]
match_year(birth_year, "Example1: My birth year is 1994.")
match_year(birth_year, "Example2: I born in 1995")
Output:
Example1: 1994
Example2: 1995
You need at least Python 3.6 for f-strings.

str1=My birth year is 1994.
str2=str1.replace('My birth year is ','')
You can try something like this and replace the unnecessary string with empty string.
For the code you shared, you can do something like :
for x in examples:
for y in birth_year:
if x.find(y)==1: #checking if the substring exists in example
x.replace(y,'') #if it exists we replace it with empty string
I think the above code might work

If you can guaranty those "strings like" always contain one 4 digits number, which is a year of birth, somewhere in there... i'd say just use regex to get whatever 4 digits in there surrounded by non-digits. Rather dumb, but hey, works with your data.
import re
examples = ["My birth year is 1993.", "I born in 1995", "я родился в 1976м году"]
for str in examples:
y = int(re.findall(r"^[^\d]*([\d]{4})[^\d]*$", str)[0])
print(y)

Related

Given a string, extract all the necessary information about the person

In my homework, I need to extract the first name, last name, ID code, phone number, date of birth and address of a person from a given string using Regex. The order of the parameters always remains the same. Each parameter requires a separate pattern.
Requirements are as follows:
Both first and last names always begin with a capital letter followed by at least one lowercase letter.
ID code is always 11 characters long and consists only of numbers.
The phone number itself is a combination of 7-8 numbers. The phone number might be separated from the area code with a whitespace, but not necessarily. It is also possible that there is no area code at all.
Date of birth is formatted as dd-MM-YYYY
Address is everything else that remains.
I got the following patterns for each parameter:
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
first_name_pattern = r"^[A-Z][a-z]+"
last_name_pattern = r"[A-z][a-z]+(?=[0-9])"
id_code_pattern = r"\d{11}(?=\+)"
phone_number_pattern = r"\+\d{3}?\s*\d{7,8}"
date_pattern = r"\d{1,2}\-\d{1,2}\-\d{1,4}"
address_pattern = r"[A-Z][a-z]*\s.*$"
first_name_match = re.findall(first_name_pattern, str1)
last_name_match = re.findall(last_name_pattern, str1)
id_code_match = re.findall(id_code_pattern, str1)
phone_number_match = re.findall(phone_number_pattern, str1)
date_match = re.findall(date_pattern, str1)
address_match = re.findall(address_pattern, str1)
So, given "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti", I get ['Heino'] ['Plekk'] ['69712047623'] ['+372 56887364' ] ['12-09-2020'] ['Tartu mnt 183,Tallinn,16881,Eesti'], which suits me perfectly.
The problem starts when the area code is missing, because now id_code_pattern can't find the id code because of (?=\+), and if one tries to use |\d{11} (or) there is another problem because now it finds both id code and phone number (69712047623 and 37256887364). And how to improve phone_number_pattern so that it finds only 7 or 8 digits of the phone number, I do not understand.
A single expression with some well-crafted capture groups will help you immensely:
import re
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
pattern = r"^(?P<first_name>[A-Z][a-z]+)(?P<last_name>[A-Z][a-z]+)(?P<id_code>\d{11})(?P<phone>(?:\+\d{3})?\s*\d{7,8})(?P<dob>\d{1,2}\-\d{1,2}\-\d{1,4})(?P<address>.*)$"
print(re.match(pattern, str1).groupdict())
Repl.it | regex101
Result:
{'first_name': 'Heino', 'last_name': 'Plekk', 'id_code': '69712047623', 'phone': '+37256887364', 'dob': '12-09-2020', 'address': 'Tartu mnt 183,Tallinn,16881,Eesti'}

How to find the words correspond to month and replace it with numerical?

How to find the words that correspond to the month "January, February, March,.. etc." and replace them with numerical "01, 02, 03,.."
I tried the code below
def transformMonths(string):
rep = [("May", "05"), ("June", "06")]
for pat, repl in rep:
s = re.sub(pat, repl, string)
return s
print( transformMonths('I was born on June 24 and my sister was born on May 17') )
My code provides this result ('I was born on 06 24 and my sister was born on May 17')
However, I want the output to be like this ('I was born on 06 24 and my sister was born on 05 17')
You are performing the replacement on the initial (unmodified) string at each iteration so you end up with only one month name being replaced. You can fix that by assigning string instead of s in the loop (and return string at the end).
Note that your approach does not require a regular expression and could use a simple string replace: string = string.replace(pat,repl).
In both cases, because the replacement does not take into account word boundaries, the function would replace partial words such as:
"Mayor Smith was elected on May 25" --> "05or Smith was elected on 05 25".
You can fix that in your regular expression by adding \b before and after each month name. This will ensure that the month names are only found if they are between word boundaries.
The re.sub can perform multiple replacements with varying values if you give it a function instead of a fixed string. So you can build a combined regular expression that will find all the months and replace the words that are found using a dictionary:
import re
def numericMonths(string):
months = {"January":"01", "Ffebruary":"02","March":"03", "April":"04",
"May":"05", "June":"06", "July":"07", "August":"08",
"September":"09","October":"10", "November":"11","December":"12"}
pattern = r"\b("+"|".join(months)+r")\b" # all months as distinct words
return re.sub(pattern,lambda m:months[m.group()],string)
output:
numericMonths('I was born on June 24 and my sister was born on May 17')
'I was born on 06 24 and my sister was born on 05 17'

How to write a regex in python to recognize days inside a string

In this assignment, the input wanted is in this format:
Regular: 16Mar2009(mon), 17Mar2009(tues), 18Mar2009(wed) ...
Reward: 26Mar2009(thur), 27Mar2009(fri), 28Mar2009(sat)
Regular or Reward is the name of customer type. I separated this string like this.
entry_list = input.split(":") #input is a variable
client = entry_list[0] # only Regular or Reward
dates = entry_list[1] # only dates
days = dates.split(",")
But now I need to count weekdays or weekend days inside the days list:
days = [' 16Mar2009(mon)', ' 17Mar2009(tues)', ' 18Mar2009(wed)']
When it is mon tues wed thur fri, all count as weekday, and I need to know how many weekdays the input have.
When it is sat sun must be counted as weekend days, and I need to know how many weekends the input have.
How to write a regex in python to search for all weekdays and weekend days inside this list and count them, putting the number of weekdays and weekend days in two different counters?
EDIT
I wrote this function to check if the dates are in the write format but it's not working:
def is_date_valid(date):
date_regex = re.compile(r'(?:\d{1,2}[A-Za-z]{3}\d{4}\([A-Za-z]{3}\),\s+){2}\d{1,2}[A-Za-z]{3}\d{4}\([A-Za-z]{3}\)$')
m = date_regex.search(date)
m is only returning None
You don't really need a regex for this. You can just look for "sat" and "sun" tags directly, since your days are formatted the same way (i.e. no capitals, no "tue" instead of "tues", etc.) you shouldn't need to generalize to a pattern. Just loop through the list and look for "sat" and "sun":
import re #if you are using the re
days = [' 16Mar2009(mon)', ' 17Mar2009(tues)', ' 18Mar2009(wed)', ' 18Mar2009(sat)', ' 18Mar2009(sun)']
weekends = 0
weekdays = 0
for day in days:
if "sat" in day or "sun" in day: #if re.search( '(sat|sun)', day ): also works
weekends = weekends+1
else:
weekdays = weekdays+1
print(weekends)
print(weekdays)
>>>2
>>>3
if you need to use a regex, because this is part of an assignment for example, then this variation of the if statement will do it: if re.search( '(sat|sun)', day ): This isn't too much more useful than just using the strings since the strings are the regex in this case, but seeing how to put multiple patterns together into one regex with or style logic is useful so I'm still including it here.

Python, regex to exclude matches of numbers

To use regex to extract any numbers of length greater than 2, in a string, but also exclude "2016", here is what I have:
import re
string = "Employee ID DF856, Year 2016, Department Finance, Team 2, Location 112 "
print re.findall(r'\d{3,}', string)
output:
['856', '2016', '112']
I tried to change it to below to exclude "2016" but all failed.
print re.findall(r'\d{3,}/^(!2016)/', string)
print re.findall(r"\d{3,}/?!2016/", string)
print re.findall(r"\d{3,}!'2016'", string)
What is the right way to do it? Thank you.
the question was extended, please see the final comment made by Wiktor Stribiżew for the update.
You may use
import re
s = "Employee ID DF856, Year 2016, Department Finance, Team 2, Location 112 20161 12016 120162"
print(re.findall(r'(?<!\d)(?!2016(?!\d))\d{3,}', s))
See the Python demo and a regex demo.
Details
(?<!\d) - no digit allowed iommediately to the left of the current location
(?!2016(?!\d)) - no 2016 not followed with another digit is allowed immediately to the right of the current location
\d{3,} - 3 or more digits.
An alternative solution with some code:
import re
s = "Employee ID DF856, Year 2016, Department Finance, Team 2, Location 112 20161 12016 120162"
print([x for x in re.findall(r'\d{3,}', s) if x != "2016"])
Here, we extract any chunks of 3 or more digits (re.findall(r'\d{3,}', s)) and then filter out those equal to 2016.
You want to use a negative lookahead. The correct syntax is:
\D(?!2016)(\d{3,})\b
Results in:
In [24]: re.findall(r'\D(?!2016)(\d{3,})\b', string)
Out[24]: ['856', '112']
Or using a negative lookbehind:
In [26]: re.findall(r'\D(\d{3,})(?<!2016)\b', string)
Out[26]: ['856', '112']
Another way to do this can be:
st="Employee ID DF856, Year 2016, Department Finance, Team 2, Location 112 "
re.findall(r"\d{3,}",re.sub("((2)?(016))","",st))
output will be:
['856', '112']
but accepted answer I see is a faster method than my suggestion.

Re - Sub all numbers except $ and percentages

I'm trying to build a regular expression in Python to sub numbers that aren't dollar values or percentages with x's. Here is an example sentence:
s = "Hi there my name is Jon Doe, I haven't been here for 4 years, my birthday is 1/23/92, I received 10% off of my $20.50 purchase."
re.sub(<pattern>, 'x', s)
I would like the output to be:
Hi there my name is Jon Doe, I haven't been here for x years, my birthday is x/xx/xx, I received 10% off of my $20.50 purchase.
Thanks!
At least this might be an attempt to discuss about, as #abarnert said.
re.sub('(?<![$.0-9])\d*[.]*\d+(?![%.0-9])', 'x', s)
It's about searching for bunches of numbers which can have periods in the middle or the beginning (\d*[.]*\d+) surrounded by negative lookarounds with regards to digits, periods and dollars before / percentages after ((?<![$.0-9])and (?![%.0-9])).
Output:
"Hi there my name is Jon Doe, I haven't been here for x years, my birthday is x/x/x, I received 10% off of my $20.50 purchase."

Categories