Replacing numbers in various formats with a word - python

I have a long sentence and I want to replace all numbers with a particular word. The numbers come in different formats, e.g.,
36
010616
010516 - 300417
01-04
2011 12
Is there function in Python for replacing these types of occurences with a word (say, "integer"), or how does the regex look for these?
Example:
"This is a 10 sentence with date 010616 and intervals 06-08 200-209 01 - 09 in years 2012 26"
should become
"This is a NUMBER sentence with date NUMBER and intervals NUMBER NUMBER NUMBER in years NUMBER NUMBER"

Using Regex.
import re
s = "This is a 10 sentence with date 010616 and intervals 06-08 200-209 01 - 09 in years 2012 26"
print( re.sub("\d+", "NUMBER", s) )
Output:
This is a NUMBER sentence with date NUMBER and intervals NUMBER-NUMBER NUMBER-NUMBER NUMBER - NUMBER in years NUMBER NUMBER

re.sub('((?<=^)|(?<= ))[0-9- ]+(?=$| )', 'NUMBER', s)
'This is a NUMBER sentence with date NUMBER and intervals NUMBER in years NUMBER'
what it does is:
looking for numbers with minus signs and spaces [0-9- ]+
with space or beginning of string before match ((?<=^)|(?<= ))
and space or end of string after match (?=$| )

Related

Finding pattern to extract text from a string using regular expession

I have data which I'm reading in a string format as
>>> 26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18
I want to seperate '26 24 16', 'Panelboards', 10/05/18 and '26 26 00i', 'Power Distribution Units – Install', 10/05/18 as sub section, name, and date.
Also after every date, new item can begin. In this case, after 10/05/18, new sub section begins.
I have used regular expression to filter out sub section as but it creates unstructuring in my data.
re.split(r'\d\d \d\d \d\d',sentence)
If anyone has solution to efficiently retrieve these 3 features for two items.
Also, I can't use two spaces as regex due to change in structural file
Try:
s = """26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18"""
out = re.split(r"\s{2,}", s)
print(out)
Prints:
['26 24 16', 'Panelboards', '10/05/18 26 26 00i', 'Power Distribution Units – Install', '10/05/18']
EDIT: If you want to split the 2nd item, use str.split() with maxsplit=1:
from itertools import chain
s = """26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18"""
out = re.split(r"\s{2,}", s)
out = list(chain(out[:2], out[2].split(maxsplit=1), out[3:]))
print(out)
Prints:
['26 24 16', 'Panelboards', '10/05/18', '26 26 00i', 'Power Distribution Units – Install', '10/05/18']
You can use
\b(?P<subsection>\d+(?:\s+\d\w*)+)\s+(?P<name>.*?)\s+(?P<date>\d{1,2}/\d{1,2}/\d{2})\b
See the regex demo. Details:
\b - word boundary
(?P<subsection>\d+(?:\s+\d\w*)+) - Group "subsection": one or more digits and then one or more occurrences of one or more whitespaces followed with a digit and then zero or more word chars
\s+ - one or mor whitespaces
(?P<name>.*?) - Group "name": zero or more chars other than line break chars as few as possible
\s+ - one or mor whitespaces
(?P<date>\d{1,2}/\d{1,2}/\d{2}) - Group "date": one or two digits, /, one or two digits, /, two digits
\b - word boundary
See a Python demo:
import re
pattern = r"\b(?P<subsection>\d+(?:\s+\d\w*)+)\s+(?P<name>.*?)\s+(?P<date>\d{1,2}/\d{1,2}/\d{2})\b"
text = "26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18"
print([x.groupdict() for x in re.finditer(pattern, text)])
Output:
[
{'subsection': '26 24 16', 'name': 'Panelboards', 'date': '10/05/18'},
{'subsection': '26 26 00i', 'name': 'Power Distribution Units – Install', 'date': '10/05/18'}
]
Be as specific as you can:
/^(\d\d \d\d \d\d) +(.+?) +(\d\d\/\d\d\/\d\d)$/
Match group 1 for the subsection, 2 for the name and 3 for the date.
If you need to split the string first into each line, you could hook that into the end of the date:
\/\d\d\s

How to find the words correspond to month and replace it with numerical?

How to find the words that correspond to the month "January, February, March,.. etc." and replace them with numerical "01, 02, 03,.."
I tried the code below
def transformMonths(string):
rep = [("May", "05"), ("June", "06")]
for pat, repl in rep:
s = re.sub(pat, repl, string)
return s
print( transformMonths('I was born on June 24 and my sister was born on May 17') )
My code provides this result ('I was born on 06 24 and my sister was born on May 17')
However, I want the output to be like this ('I was born on 06 24 and my sister was born on 05 17')
You are performing the replacement on the initial (unmodified) string at each iteration so you end up with only one month name being replaced. You can fix that by assigning string instead of s in the loop (and return string at the end).
Note that your approach does not require a regular expression and could use a simple string replace: string = string.replace(pat,repl).
In both cases, because the replacement does not take into account word boundaries, the function would replace partial words such as:
"Mayor Smith was elected on May 25" --> "05or Smith was elected on 05 25".
You can fix that in your regular expression by adding \b before and after each month name. This will ensure that the month names are only found if they are between word boundaries.
The re.sub can perform multiple replacements with varying values if you give it a function instead of a fixed string. So you can build a combined regular expression that will find all the months and replace the words that are found using a dictionary:
import re
def numericMonths(string):
months = {"January":"01", "Ffebruary":"02","March":"03", "April":"04",
"May":"05", "June":"06", "July":"07", "August":"08",
"September":"09","October":"10", "November":"11","December":"12"}
pattern = r"\b("+"|".join(months)+r")\b" # all months as distinct words
return re.sub(pattern,lambda m:months[m.group()],string)
output:
numericMonths('I was born on June 24 and my sister was born on May 17')
'I was born on 06 24 and my sister was born on 05 17'

Extracting numerical values from a string with at most 6 digits with optional 2 digits for decimal

I have a task from which I need to extract values from a text that represent numerical values. However I am interested in extracting values that have at most 6 digits with decimal being optional.
For example, from the below text:
Total compensation for Mr. XYZ was $5,123,456 and other salary which was $650,000 in fiscal 2018, was determined to be approximately 8.78 times the median annual compensation for all of the firm's other employees, which was approximately $74,000. Some other salaries are 56000.
I need to extract
["650,000", "2018", "8.78", "74,000", "56000"]
from this.
The regex I am using:
((\d{1,3})(?:,[0-9]{3}){0,1}|(\d{1,6}))(\.\d{1,2})?
It is correctly identifying 650,000 and 74,000 but doesn't identify others correctly.
I found this 7 digit money regex and worked around it to make one for 6 digit but wasn't successful. How do I correct my regex?
Try this : (?<![\d,.])(?:\d,?){0,5}\d(?:\.\d+)?(?!,?\d)
Here's a detailed explanation:
(?x) # flag for readable mode, whitespaces and comments are ignored
# Make sure to not start in the middle of a number, so no digit, comma or dot before the match
(?<![\d,.])
# k-1 digits, with facultative comma between each. Therefore 5,4,3,2 are allowed for the sake of simplicity, be aware of that
(?:\d,?){0,5}
#The kth digit
\d
# Facultative dot and decimal part
(?:\.\d+)?
# Make sure to not stop in the middle of a big number, so no digit after. Comma is allowed, but only for the grammatical comma, so comma+digit is forbidden
(?!,?\d)
There could be improvement, but I think it's what you wanted. There might be some cases not handled, tell me if you find some.
Test it here : https://regex101.com/r/Wxi5Sj/2
Try below code
import re
input = "Total compensation for Mr. XYZ was $5,123,456 and other salary which was $650,000 in fiscal 2018, was determined to be approximately 8.78 times the median annual compensation for all of the firm's other employees, which was approximately $74,000. Some other salaries are 56000. "
print(re.findall(r'(?<=\s)\$?\d{0,3}\,?\d{1,3}(?:\.\d{2})?(?!,?\d)', input))
Output
['$650,000', '2018', '8.78', '$74,000', '56000']

Regex to detect a phone number

I need to find a phone number in a given paragraph text, with the conditions as below.
The word Phone/Ph/tel/telephone should exist in the sentence where the phone number is present.
For ex: (consider the below paragraph.)
This is my Phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script.
As you can see this paragraph has a phone number signified, and it has the word "Phone" in the sentence (31 characters before the phone number).
So i would like to detect this as a phone number if and only if it has the words Phone/Ph/tel/telephone 50 characters before or after the phone number.
I tried using lookaround in regex but did not work.
import re
phno = re.compile(r'(?<=Ph\s)(?<=Phone\s)(?<=tel\s)telephone(?<=telephone\s)\b([0-9]{3}[-][0-9]{3}[-][0-9]{4})\b',re.MULTILINE)
data = "This is my phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script."
l = phno.findall(data)
print(l)
I am getting output empty list [ ] because the word 'Phone' is not detected by regex (I need it to detect 50 chars before or after phone number)
import re
data = """This is my phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 999-123-4567 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
And 555-555-1212 is my telephone."""
phno = re.compile(r'\b(?:phone|ph|telephone)\b.{0,49}\b(\d{3}[-]\d{3}[-]\d{4})\b|\b(\d{3}[-]\d{3}[-]\d{4})\b.{0,49}\b(?:phone|ph|telephone)\b', flags=re.I)
phones = [m.group(1) if m.group(1) else m.group(2) for m in phno.finditer(data)]
print(phones)
Prints:
['999-888-7894', '555-555-1212']
See demo
Assuming you only want to detect hyphen-separated US phone numbers containing area codes, you could use the following regex pattern with re.findall:
\b\d{3}-\d{3}-\d{4}\b
Script:
sentence = "This is my Phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script."
numbers = re.findall(r'\b\d{3}-\d{3}-\d{4}\b', sentence)
print(numbers)
This prints:
['999-888-7894']

Python Regex - using re.sub to clean up a string

I am having some problems using regex sub to remove numbers from strings. Input strings can look like:
"The Term' means 125 years commencing on and including 01 October 2015."
"125 years commencing on 25th December 1996"
"the term of 999 years from the 1st January 2011"
What I want to do is remove the number and the word 'years' - I am also parsing the string for dates using DateFinder, but DateFinder interprets the number as a date - hence why I want to remove the number.
Any thoughts on the regex expression to remove the number and the word 'years'?
I think this does what you want:
import re
my_list = ["The Term' means 125 years commencing on and including 01 October 2015.",
"125 years commencing on 25th December 1996",
"the term of 999 years from the 1st January 2011",
]
for item in my_list:
new_item = re.sub("\d+\syears", "", item)
print(new_item)
results:
The Term' means commencing on and including 01 October 2015.
commencing on 25th December 1996
the term of from the 1st January 2011
Note, you will end up with some extra white space (maybe you want that)? But you could also add this to 'clean' that up:
new_item = re.sub("\s+", " ", new_item)
because I love regexes: new_item = re.sub("^\s+|\s+$", "", new_item)
new_item = new_item.strip()
try this to remove numbers and word years:
re.sub(r'\s+\d+|\s+years', '', text)
if for instance:
text="The Term' means 125 years commencing on and including 01 October 2015."
then the output will be:
"The Term' means commencing on and including October."

Categories