Tokenizing with different delimiters - python

say im reading a file that has a certain structure but different every line. for example, 'directory.csv' reads the following
November 11, Veterans’s Day
November 24, Thanksgiving
December 25, Christma
i want to split the lines by space, then comma so i can have the month, the day, and the holiday. i want to use re.split but i dont know how to set up the regular expression format wise. this is what i have
fp = open('holidays2011.csv', 'r')
import re
for item in fp :
month, day, holiday = re.split('; |, ', item)
print month, day, holiday
but when i print it says i dont have enough items to unpack. but why? im splitting at the space and the comma which gives me 3 items which i named as 3 variables

You don't need Regular Expressions for this,
with open("Input.txt") as inFile:
for item in inFile:
datePart, holiday = item.split(", ", 1)
month, day = datePart.split()

Splitting first on space is a bad idea due to the space character in the holiday name. You can use regex grouping to obtain the parts without using re.split (note the parenthesis around the parts):
>>> import re
>>> s = """November 11, Veterans’s Day
... November 24, Thanksgiving
... December 25, Christmas"""
>>> for line in s.split('\n'):
... month, day, holiday = re.match(r'(\w+) (\d+), (.+)', line).groups()
... print month
... print day
... print holiday
... print ''
...
November
11
Veterans’s Day
November
24
Thanksgiving
December
25
Christmas

Related

How to find the words correspond to month and replace it with numerical?

How to find the words that correspond to the month "January, February, March,.. etc." and replace them with numerical "01, 02, 03,.."
I tried the code below
def transformMonths(string):
rep = [("May", "05"), ("June", "06")]
for pat, repl in rep:
s = re.sub(pat, repl, string)
return s
print( transformMonths('I was born on June 24 and my sister was born on May 17') )
My code provides this result ('I was born on 06 24 and my sister was born on May 17')
However, I want the output to be like this ('I was born on 06 24 and my sister was born on 05 17')
You are performing the replacement on the initial (unmodified) string at each iteration so you end up with only one month name being replaced. You can fix that by assigning string instead of s in the loop (and return string at the end).
Note that your approach does not require a regular expression and could use a simple string replace: string = string.replace(pat,repl).
In both cases, because the replacement does not take into account word boundaries, the function would replace partial words such as:
"Mayor Smith was elected on May 25" --> "05or Smith was elected on 05 25".
You can fix that in your regular expression by adding \b before and after each month name. This will ensure that the month names are only found if they are between word boundaries.
The re.sub can perform multiple replacements with varying values if you give it a function instead of a fixed string. So you can build a combined regular expression that will find all the months and replace the words that are found using a dictionary:
import re
def numericMonths(string):
months = {"January":"01", "Ffebruary":"02","March":"03", "April":"04",
"May":"05", "June":"06", "July":"07", "August":"08",
"September":"09","October":"10", "November":"11","December":"12"}
pattern = r"\b("+"|".join(months)+r")\b" # all months as distinct words
return re.sub(pattern,lambda m:months[m.group()],string)
output:
numericMonths('I was born on June 24 and my sister was born on May 17')
'I was born on 06 24 and my sister was born on 05 17'

How to write a regex in python to recognize days inside a string

In this assignment, the input wanted is in this format:
Regular: 16Mar2009(mon), 17Mar2009(tues), 18Mar2009(wed) ...
Reward: 26Mar2009(thur), 27Mar2009(fri), 28Mar2009(sat)
Regular or Reward is the name of customer type. I separated this string like this.
entry_list = input.split(":") #input is a variable
client = entry_list[0] # only Regular or Reward
dates = entry_list[1] # only dates
days = dates.split(",")
But now I need to count weekdays or weekend days inside the days list:
days = [' 16Mar2009(mon)', ' 17Mar2009(tues)', ' 18Mar2009(wed)']
When it is mon tues wed thur fri, all count as weekday, and I need to know how many weekdays the input have.
When it is sat sun must be counted as weekend days, and I need to know how many weekends the input have.
How to write a regex in python to search for all weekdays and weekend days inside this list and count them, putting the number of weekdays and weekend days in two different counters?
EDIT
I wrote this function to check if the dates are in the write format but it's not working:
def is_date_valid(date):
date_regex = re.compile(r'(?:\d{1,2}[A-Za-z]{3}\d{4}\([A-Za-z]{3}\),\s+){2}\d{1,2}[A-Za-z]{3}\d{4}\([A-Za-z]{3}\)$')
m = date_regex.search(date)
m is only returning None
You don't really need a regex for this. You can just look for "sat" and "sun" tags directly, since your days are formatted the same way (i.e. no capitals, no "tue" instead of "tues", etc.) you shouldn't need to generalize to a pattern. Just loop through the list and look for "sat" and "sun":
import re #if you are using the re
days = [' 16Mar2009(mon)', ' 17Mar2009(tues)', ' 18Mar2009(wed)', ' 18Mar2009(sat)', ' 18Mar2009(sun)']
weekends = 0
weekdays = 0
for day in days:
if "sat" in day or "sun" in day: #if re.search( '(sat|sun)', day ): also works
weekends = weekends+1
else:
weekdays = weekdays+1
print(weekends)
print(weekdays)
>>>2
>>>3
if you need to use a regex, because this is part of an assignment for example, then this variation of the if statement will do it: if re.search( '(sat|sun)', day ): This isn't too much more useful than just using the strings since the strings are the regex in this case, but seeing how to put multiple patterns together into one regex with or style logic is useful so I'm still including it here.

Extract Numbers from Text File excluding Dates

I have simple code which extracts numbers from a text file. It looks like this:
import re
codata = []
with open(r"filename.txt") as file:
for line in file:
for i in re.findall(r'\d+(?:\.\d+)?', line):
codata.append(i)
The text contains a lot financial data and also a lot of dates which I don't want. Is there an easy way to modify the code to exclude dates? The dates generally follow these formats (I'm using a specific date as an example for the format but it can be any date):
August 31, 2018
8/31/2018
8/31/18
August 2018
FY2018
CY2018
fiscal year 2018
calendar year 2018
Here is an example. I have a text file with the following text:
"For purposes of the financial analyses described in this section, the term “implied merger consideration” refers to the implied value of the per share consideration provided for in the transaction of $80.38 consisting of the cash portion of the consideration of $20.25 and the implied value of the stock portion of the consideration of 0.275 shares of XXX common stock based on XXX’s closing stock price of $218.67 per share on July 14, 2018."
When I run my code I posted above, I get this output from print(codata):
['80.38', '20.25', '0.275', '218.67', '14', '2018']
I would like to get this output instead:
['80.38', '20.25', '0.275', '218.67']
So I don't want to pick up the numbers 14 and 2018 associated with the date "July 14, 2018". If I know that any numbers related to dates within the text would have the formats that I outlined above, how should I modify my code to get the desired output?
Hard to understand exactly what you want. But if you are just looking for numbers you can do this (and if it has a decimal, use float instead).
import re
codata = []
with open(r"filename.txt") as file:
for line in file:
for i in re.findall(r'\d+(?:\.\d+)?', line):
try:
codata.append(int(i))
except:
continue
Here's a regex that will match and remove your current set of dates:
import re
p = r"(((january|february|march|april|may|june|july|august|september|october|november|december) +[\d, ]+)|" + \
r"((\d?\d\/){2}(\d\d){1,2})|" + r"((fiscal year|fy|calendar year|cy) *(\d\d){1,2}))"
codata = []
with open(r"filename.txt") as file:
for line in file:
codata.append(re.sub(p, "", line, flags=re.IGNORECASE))
print(codata)
Output (assuming input file is the same as your provided date list):
['\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n']
Considering the text sample I assume that every price starts with $ sign, in that case you are probably looking for the following regex:
r"(?<=\$)\d+\.?\d*(?= )"
the result would be:
['80.38', '20.25', '218.67']
Or in case you want the $ sign in your list the regex would be:
r"\$\d+\.?\d*(?= )"
and the result in that case:
['$80.38', '$20.25', '$218.67']
To clarify, the (?<=\$) means that our match needs to be proceeded by the $ sign, but the $ sign is not added to the output. (?= ) means that the price should be followed by space.

Python Regex - using re.sub to clean up a string

I am having some problems using regex sub to remove numbers from strings. Input strings can look like:
"The Term' means 125 years commencing on and including 01 October 2015."
"125 years commencing on 25th December 1996"
"the term of 999 years from the 1st January 2011"
What I want to do is remove the number and the word 'years' - I am also parsing the string for dates using DateFinder, but DateFinder interprets the number as a date - hence why I want to remove the number.
Any thoughts on the regex expression to remove the number and the word 'years'?
I think this does what you want:
import re
my_list = ["The Term' means 125 years commencing on and including 01 October 2015.",
"125 years commencing on 25th December 1996",
"the term of 999 years from the 1st January 2011",
]
for item in my_list:
new_item = re.sub("\d+\syears", "", item)
print(new_item)
results:
The Term' means commencing on and including 01 October 2015.
commencing on 25th December 1996
the term of from the 1st January 2011
Note, you will end up with some extra white space (maybe you want that)? But you could also add this to 'clean' that up:
new_item = re.sub("\s+", " ", new_item)
because I love regexes: new_item = re.sub("^\s+|\s+$", "", new_item)
new_item = new_item.strip()
try this to remove numbers and word years:
re.sub(r'\s+\d+|\s+years', '', text)
if for instance:
text="The Term' means 125 years commencing on and including 01 October 2015."
then the output will be:
"The Term' means commencing on and including October."

Python Regex - Different Results in findall and sub

I am trying to replace occurrences of the work 'brunch' with 'BRUNCH'. I am using a regex which correctly identifies the occurrence, but when I try to use re.sub it is replacing more text than identified with re.findall. The regex that I am using is:
re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)[^.]*(brunch)',re.IGNORECASE)
The string is
str = 'Valid only for dine-in January 2 - March 31, 2015. Excludes brunch, happy hour, holidays, and February 13 - 15, 2015.'
I want it to produce:
'Valid only for dine-in January 2 - March 31, 2015. Excludes BRUNCH, happy hour, holidays, and February 13 - 15, 2015.'
The steps:
>>> reg.findall(str)
>>> ['brunch']
>>> reg.sub('BRUNCH',str)
>>> Valid only for dine-in January 2 - March 31, 2015BRUNCH, happy hour, holidays, and February 13 - 15, 2015.
Edit:
The final solution that I used was:
re.compile(r'((?:^|\.))(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)(brunch)',re.IGNORECASE)
re.sub('\g<1>\g<2>BRUNCH',str)
For re.sub use
(^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)(brunch)
Replace by \1\2BRUNCH.See demo.
https://regex101.com/r/eZ0yP4/16
Through regex:
(^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)brunch
DEMO
Replace the matched characters by \1\2BRUNCH
Why does it match more than brunch
Because your regex actually does match more than brunch
See link on how the regex match
Why doesnt it show in findall?
Because you have wraped only the brunch in paranthesis
>>> reg = re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)[^.]*(brunch)',re.IGNORECASE)
>>> reg.findall(str)
['brunch']
After wraping entire ([^.]*brunch) in paranthesis
>>> reg = re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*brunch)',re.IGNORECASE)
>>> reg.findall(str)
[' Excludes brunch']
re.findall ignores those are not caputred

Categories