Python Regex - using re.sub to clean up a string - python

I am having some problems using regex sub to remove numbers from strings. Input strings can look like:
"The Term' means 125 years commencing on and including 01 October 2015."
"125 years commencing on 25th December 1996"
"the term of 999 years from the 1st January 2011"
What I want to do is remove the number and the word 'years' - I am also parsing the string for dates using DateFinder, but DateFinder interprets the number as a date - hence why I want to remove the number.
Any thoughts on the regex expression to remove the number and the word 'years'?

I think this does what you want:
import re
my_list = ["The Term' means 125 years commencing on and including 01 October 2015.",
"125 years commencing on 25th December 1996",
"the term of 999 years from the 1st January 2011",
]
for item in my_list:
new_item = re.sub("\d+\syears", "", item)
print(new_item)
results:
The Term' means commencing on and including 01 October 2015.
commencing on 25th December 1996
the term of from the 1st January 2011
Note, you will end up with some extra white space (maybe you want that)? But you could also add this to 'clean' that up:
new_item = re.sub("\s+", " ", new_item)
because I love regexes: new_item = re.sub("^\s+|\s+$", "", new_item)
new_item = new_item.strip()

try this to remove numbers and word years:
re.sub(r'\s+\d+|\s+years', '', text)
if for instance:
text="The Term' means 125 years commencing on and including 01 October 2015."
then the output will be:
"The Term' means commencing on and including October."

Related

How to find the words correspond to month and replace it with numerical?

How to find the words that correspond to the month "January, February, March,.. etc." and replace them with numerical "01, 02, 03,.."
I tried the code below
def transformMonths(string):
rep = [("May", "05"), ("June", "06")]
for pat, repl in rep:
s = re.sub(pat, repl, string)
return s
print( transformMonths('I was born on June 24 and my sister was born on May 17') )
My code provides this result ('I was born on 06 24 and my sister was born on May 17')
However, I want the output to be like this ('I was born on 06 24 and my sister was born on 05 17')
You are performing the replacement on the initial (unmodified) string at each iteration so you end up with only one month name being replaced. You can fix that by assigning string instead of s in the loop (and return string at the end).
Note that your approach does not require a regular expression and could use a simple string replace: string = string.replace(pat,repl).
In both cases, because the replacement does not take into account word boundaries, the function would replace partial words such as:
"Mayor Smith was elected on May 25" --> "05or Smith was elected on 05 25".
You can fix that in your regular expression by adding \b before and after each month name. This will ensure that the month names are only found if they are between word boundaries.
The re.sub can perform multiple replacements with varying values if you give it a function instead of a fixed string. So you can build a combined regular expression that will find all the months and replace the words that are found using a dictionary:
import re
def numericMonths(string):
months = {"January":"01", "Ffebruary":"02","March":"03", "April":"04",
"May":"05", "June":"06", "July":"07", "August":"08",
"September":"09","October":"10", "November":"11","December":"12"}
pattern = r"\b("+"|".join(months)+r")\b" # all months as distinct words
return re.sub(pattern,lambda m:months[m.group()],string)
output:
numericMonths('I was born on June 24 and my sister was born on May 17')
'I was born on 06 24 and my sister was born on 05 17'

Delete all the digits from a string except the digits that are followed by given letter using re.sub() in python3 [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I know how to remove all digits from a string using re.sub(). But I don't know how to remove all digits from a string except some special ones.
For example, let's say I have the string below:
"On 11 December 2008, India entered the 3G arena"
And I want the output as:
"On December, India entered the 3G arena"
You may use a Negative Lookahead (?!...) to ensure that content following the digit is not a letter you set
Here an example where all digits followed by any of there char GJK are not concerned by the suppression
import re
print(re.sub(r"\d(?![GJK])", "", "On 11 December 2008, India entered the 3G arena 1A 3J 5K"))
# On December , India entered the 3G arena A 3J 5K
You might use \b (word boundary) to delete numbers which are not apparently part of words, following way:
import re
txt = "On 11 December 2008, India entered the 3G arena"
cleaned = re.sub(r'\b \d+\b','',txt)
print(cleaned)
Output:
On December, India entered the 3G arena
Note that there is space before \d+ as otherwise you would end with doubled spaces. This solution assumed that digits to remove are always after space, if this does not hold true you might use r'\b\d+\b' and then remove superflouos spaces.
While azro's answer covers the general case, here's a solution to remove numbers around month names:
import calendar
month_names = '|'.join([calendar.month_name[i] for i in range(1,13)])
s = "On 11 December 2008, India entered the 3G arena"
re.sub(fr"\d+\s+({month_names})\s+\d+", r"\1", s)
#'On December, India entered the 3G arena'

How to write a regex in python to recognize days inside a string

In this assignment, the input wanted is in this format:
Regular: 16Mar2009(mon), 17Mar2009(tues), 18Mar2009(wed) ...
Reward: 26Mar2009(thur), 27Mar2009(fri), 28Mar2009(sat)
Regular or Reward is the name of customer type. I separated this string like this.
entry_list = input.split(":") #input is a variable
client = entry_list[0] # only Regular or Reward
dates = entry_list[1] # only dates
days = dates.split(",")
But now I need to count weekdays or weekend days inside the days list:
days = [' 16Mar2009(mon)', ' 17Mar2009(tues)', ' 18Mar2009(wed)']
When it is mon tues wed thur fri, all count as weekday, and I need to know how many weekdays the input have.
When it is sat sun must be counted as weekend days, and I need to know how many weekends the input have.
How to write a regex in python to search for all weekdays and weekend days inside this list and count them, putting the number of weekdays and weekend days in two different counters?
EDIT
I wrote this function to check if the dates are in the write format but it's not working:
def is_date_valid(date):
date_regex = re.compile(r'(?:\d{1,2}[A-Za-z]{3}\d{4}\([A-Za-z]{3}\),\s+){2}\d{1,2}[A-Za-z]{3}\d{4}\([A-Za-z]{3}\)$')
m = date_regex.search(date)
m is only returning None
You don't really need a regex for this. You can just look for "sat" and "sun" tags directly, since your days are formatted the same way (i.e. no capitals, no "tue" instead of "tues", etc.) you shouldn't need to generalize to a pattern. Just loop through the list and look for "sat" and "sun":
import re #if you are using the re
days = [' 16Mar2009(mon)', ' 17Mar2009(tues)', ' 18Mar2009(wed)', ' 18Mar2009(sat)', ' 18Mar2009(sun)']
weekends = 0
weekdays = 0
for day in days:
if "sat" in day or "sun" in day: #if re.search( '(sat|sun)', day ): also works
weekends = weekends+1
else:
weekdays = weekdays+1
print(weekends)
print(weekdays)
>>>2
>>>3
if you need to use a regex, because this is part of an assignment for example, then this variation of the if statement will do it: if re.search( '(sat|sun)', day ): This isn't too much more useful than just using the strings since the strings are the regex in this case, but seeing how to put multiple patterns together into one regex with or style logic is useful so I'm still including it here.

Replacing numbers in various formats with a word

I have a long sentence and I want to replace all numbers with a particular word. The numbers come in different formats, e.g.,
36
010616
010516 - 300417
01-04
2011 12
Is there function in Python for replacing these types of occurences with a word (say, "integer"), or how does the regex look for these?
Example:
"This is a 10 sentence with date 010616 and intervals 06-08 200-209 01 - 09 in years 2012 26"
should become
"This is a NUMBER sentence with date NUMBER and intervals NUMBER NUMBER NUMBER in years NUMBER NUMBER"
Using Regex.
import re
s = "This is a 10 sentence with date 010616 and intervals 06-08 200-209 01 - 09 in years 2012 26"
print( re.sub("\d+", "NUMBER", s) )
Output:
This is a NUMBER sentence with date NUMBER and intervals NUMBER-NUMBER NUMBER-NUMBER NUMBER - NUMBER in years NUMBER NUMBER
re.sub('((?<=^)|(?<= ))[0-9- ]+(?=$| )', 'NUMBER', s)
'This is a NUMBER sentence with date NUMBER and intervals NUMBER in years NUMBER'
what it does is:
looking for numbers with minus signs and spaces [0-9- ]+
with space or beginning of string before match ((?<=^)|(?<= ))
and space or end of string after match (?=$| )

Tokenizing with different delimiters

say im reading a file that has a certain structure but different every line. for example, 'directory.csv' reads the following
November 11, Veterans’s Day
November 24, Thanksgiving
December 25, Christma
i want to split the lines by space, then comma so i can have the month, the day, and the holiday. i want to use re.split but i dont know how to set up the regular expression format wise. this is what i have
fp = open('holidays2011.csv', 'r')
import re
for item in fp :
month, day, holiday = re.split('; |, ', item)
print month, day, holiday
but when i print it says i dont have enough items to unpack. but why? im splitting at the space and the comma which gives me 3 items which i named as 3 variables
You don't need Regular Expressions for this,
with open("Input.txt") as inFile:
for item in inFile:
datePart, holiday = item.split(", ", 1)
month, day = datePart.split()
Splitting first on space is a bad idea due to the space character in the holiday name. You can use regex grouping to obtain the parts without using re.split (note the parenthesis around the parts):
>>> import re
>>> s = """November 11, Veterans’s Day
... November 24, Thanksgiving
... December 25, Christmas"""
>>> for line in s.split('\n'):
... month, day, holiday = re.match(r'(\w+) (\d+), (.+)', line).groups()
... print month
... print day
... print holiday
... print ''
...
November
11
Veterans’s Day
November
24
Thanksgiving
December
25
Christmas

Categories