Does anyone know a cleaner way to write this regex? - python

(?:reminder|Reminder)\s\d+\s\b(?:second|seconds|Second|Seconds|minute|minutes|Minute|Minutes|hour|hours|Hour|Hours|day|days|Day|Days|week|weeks|Week|Weeks|month|months|Month|Months|year|years|Year|Years)\b
Objective format: "Reminder 3 seconds", "Reminder 20 days", "Reminder 3 second" etc

[rR]eminder\s\d+\s(?:[sS]econd|[mM]inute|[hH]our|[dD]ay|[wW]eek|[mM]onth|[yY]ear)s?\b
I think this works. Most of the changes I made were putting characters into groups. A little bit of it was moving the sometimes-s outside the group. Does this make sense to you?

I'm guessing that maybe less boundaries might be OK here,
(?i)\breminder\s+\d+\s+\b(?:seconds?|minutes?|hours?|days?|weeks?|months?|years?)\b
or maybe not.
Demo
Test
import re
expression = r"(?i)\breminder\s+\d+\s+\b(?:seconds?|minutes?|hours?|days?|weeks?|months?|years?)\b"
string = """
Reminder 3 seconds some data here, Reminder 20 days and some more data, Reminder 3 second and Reminder 3 WEek
"""
print(re.findall(expression, string))
Output
['Reminder 3 seconds', 'Reminder 20 days', 'Reminder 3 second', 'Reminder 3 WEek']

Related

Extract date from a string with a lot of numbers

There seems to be quite a few ways to extract datetimes in various formats from a string. But there seems to be an issue when the string contains many numbers and symbols.
Here is an example:
t = 'Annual Transmission Revenue Requirements and Rates Transmission Owner (Transmission Zone) Annual Transmission Revenue Requirement Network Integration Transmission Service Rate ($/MW-Year) AE (AECO) $136,632,319 $53,775 AEP (AEP) $1,295,660,732 $59,818.14 AP (APS) $128,000,000 $17,895 ATSI (ATSI) $659,094,666 $54,689.39 BC (BGE) $230,595,535 $35,762 ComEd, Rochelle (CE) $702,431,433 $34,515.60 Dayton (DAY) $40,100,000 $13,295.76 Duke (DEOK) $121,250,903 $24,077 Duquesne (DLCO) $139,341,808 $51,954.44 Dominion (DOM) $1,031,382,000 $52,457.21 DPL, ODEC (DPL) $163,224,128 $42,812 East Kentucky Power Cooperative (EKPC) $83,267,903 $24,441 MAIT (METED, PENELEC) $150,858,703 $26,069.39 JCPL $135,000,000 $23,597.27 PE (PECO) $155,439,100 $19,093 PPL, AECoop, UGI (PPL) $435,349,329 $58,865 PEPCO, SMECO (PEPCO) $190,876,083 $31,304.21 PS (PSEG) $1,248,819,352 $130,535.22 Rockland (RECO) $17,724,263 $44,799 TrAILCo $226,652,117.80 n/a Effective June 1, 2018 '
import datefinder
m = datefinder.find_dates(t)
for match in m:
print(match)
Is there a way to smoothly extract the date? I can resort to re for specific formats if no better way exists. From github of datefinder it seems that it was abandoned a year ago.
Although I dont know exactly how your dates are formatted, here's a regex solution that will work with dates separated by '/'. Should work with dates where the months and days are expressed as a single number or if they include a leading zero.
If your dates are separated by hyphens instead, replace the 9th and 18th character of the regex with a hyphen instead of /. (If using the second print statement, replace the 12th and 31st character)
Edit: Added the second print statement with some better regex. That's probably the better way to go.
import re
mystring = r'joasidj9238nlsd93901/01/2021oijweo8939n'
print(re.findall('\d{1,2}\/\d{1,2}\/\d{2,4}', mystring)) # This would probably work in most cases
print(re.findall('[0-1]{0,2}\/[0-3]{0,1}\d{0,1}\/\d{2,4}', mystring)) # This one is probably a better solution. (More protection against weirdness.)
Edit #2: Here's a way to do it with the month name spelled out (in full, or 3-character abbreviation), followed by day, followed by comma, followed by a 2 or 4 digit year.
import re
mystring = r'Jan 1, 2020'
print(re.findall(r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{1,2}\,\s+\d{2,4}',mystring))

Removing rows from a DataFrame based on words in a string

Novice programmer here seeking help.
I have a Dataframe that looks like this:
Current
0 "Invest in $APPL, $FB and $AMZN"
1 "Long $AAPL, Short $AMZN"
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I also have a list with all the hashtags: cashtags = ["$AAPL", "$FB", $AMZN"]
Basically, I want to go through all the lines in this column of the DataFrame and keep the rows with a unique cashtag, regardless if it is in caps or not, and delete all others.
Desired Output:
Desired
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I've tried to basically count how many times the word appears in the string and add that value to a new column so that I can delete the rows based on the number.
for i in range(0,len(df)-1):
print(i, end = "\r")
tweet = df["Current"][i]
count = 0
for word in cashtags:
count += str(tweet).count(word)
df["Word_count"][i] = count
However if I do this I will be deleting rows that I don't want to. For example, rows where the unique cashtag is mentioned several times ([3],[5])
How can I achieve my desired output?
Rather than summing the count of each cashtag, you should sum its presence or absence, since you don't care how many times each cashtag occurs, only how many cashtags.
for tag in cashtags:
count += tag in tweet
Or more succinctly: sum(tag in tweet for tag in cashtags)
To make the comparison case insensitive, you can upper case the tweets beforehand. Additionally, it would be more idiomatic to filter on a temporary series and avoid explicitly looping over the dataframe (though you may need to read up more about Pandas to understand how this works):
df[df.Current.apply(lambda tweet: sum(tag in tweet.upper() for tag in cashtags)) == 1]
If you ever want to generalise your question to any tag, then this is a good place for a regular expression.
You want to match against (\$w+)(?!.*/1) see e.g. here for a detailed explanation, but the general structure is:
\$w+: find a dollar sign followed by one or more letters/numbers (or
an _), if you just wanted to count how many tags you had this is all you need
e.g.
df.Current.str.count(r'\$\w+')
will print
0 3
1 2
2 1
3 2
4 1
5 2
but this will remove cases where you have the same element more than once so you need to add a negative lookahead meaning don't match
(?!.*/1): Is a negative lookahead, this means don't match if it is followed by the same match later on. This will mean that only the last tag is counted in the string.
Using this, you can then use pandas DataFrame.str methods, specifically DataFrame.str.count (the re.I does a case insensitive match)
import re
df[df.Current.str.count(r'(\$\w+)(?!.*\1)', re.I) == 1]
which will give you your desired output
Current
2 $AAPL earnings announcement soon
3 $FB is releasing a new product. Will $FB's pro...
4 $Fb doing good today
5 $AMZN high today. Will $amzn continue like this?

How to write a regex in python to recognize days inside a string

In this assignment, the input wanted is in this format:
Regular: 16Mar2009(mon), 17Mar2009(tues), 18Mar2009(wed) ...
Reward: 26Mar2009(thur), 27Mar2009(fri), 28Mar2009(sat)
Regular or Reward is the name of customer type. I separated this string like this.
entry_list = input.split(":") #input is a variable
client = entry_list[0] # only Regular or Reward
dates = entry_list[1] # only dates
days = dates.split(",")
But now I need to count weekdays or weekend days inside the days list:
days = [' 16Mar2009(mon)', ' 17Mar2009(tues)', ' 18Mar2009(wed)']
When it is mon tues wed thur fri, all count as weekday, and I need to know how many weekdays the input have.
When it is sat sun must be counted as weekend days, and I need to know how many weekends the input have.
How to write a regex in python to search for all weekdays and weekend days inside this list and count them, putting the number of weekdays and weekend days in two different counters?
EDIT
I wrote this function to check if the dates are in the write format but it's not working:
def is_date_valid(date):
date_regex = re.compile(r'(?:\d{1,2}[A-Za-z]{3}\d{4}\([A-Za-z]{3}\),\s+){2}\d{1,2}[A-Za-z]{3}\d{4}\([A-Za-z]{3}\)$')
m = date_regex.search(date)
m is only returning None
You don't really need a regex for this. You can just look for "sat" and "sun" tags directly, since your days are formatted the same way (i.e. no capitals, no "tue" instead of "tues", etc.) you shouldn't need to generalize to a pattern. Just loop through the list and look for "sat" and "sun":
import re #if you are using the re
days = [' 16Mar2009(mon)', ' 17Mar2009(tues)', ' 18Mar2009(wed)', ' 18Mar2009(sat)', ' 18Mar2009(sun)']
weekends = 0
weekdays = 0
for day in days:
if "sat" in day or "sun" in day: #if re.search( '(sat|sun)', day ): also works
weekends = weekends+1
else:
weekdays = weekdays+1
print(weekends)
print(weekdays)
>>>2
>>>3
if you need to use a regex, because this is part of an assignment for example, then this variation of the if statement will do it: if re.search( '(sat|sun)', day ): This isn't too much more useful than just using the strings since the strings are the regex in this case, but seeing how to put multiple patterns together into one regex with or style logic is useful so I'm still including it here.

Python: fastest way to re.findall twice?

I like regular expressions. I often find myself using multiple regex statements to narrow in on the value I need when trying to get a substring from a large block of text.
So far, my approach has been the following:
Use resultOfRegex1 = re.findall(firstRegex, myString) for my first regex
Check to see that resultOfRegex1[0] exists
Use resultOfRegex2 = re.findall(secondRegex, resultOfRegex1[0]) for
my second regex
Check to see that resultOfRegex2[0] exists, and print that value
But I feel like this is much more verbose and costly than it has to be. Is there an easier/faster way to match one regex and then match another regex based on the result of the first?
The whole point of groups is to allow extraction of subgroups from an overall match.
For example, instead two searches done the following fashion:
>>> import re
>>> s = 'The winning team scored 15 points and used only 2 timeouts'
>>> score_clause = re.search(r'scored \d+ point', s).group(0)
>>> re.search(r'\d+', score_clause).group(0)
'15'
Do a single search with a sub-group:
>>> re.search(r'scored (\d+) point', s).group(1)
'15'
One other thought: if you want to make decisions about whether to continue a findall-style search based on the first match, a reasonable choice would be to use re.finditer and extract values as needed:
>>> game_results = '''\
10 point victory: 1 in first period, 6 in second period, 3 in third period.
5 point victory: 0 in first period, 5 in second period, 0 in third period.
12 point victory: 5 in first period, 3 in second period, 4 in third period.
7 point victory: 3 in first period, 0 in second period, 4 in third period.
'''.splitlines()
>>> # Show period-by-period scores for games won by 8 or more points
>>> for game_result in game_results:
it = re.finditer(r'\d+', game_result)
if int(next(it).group(0)) >= 8:
print 'Big win:', [int(mo.group(0)) for mo in it]
Big win: [1, 6, 3]
Big win: [5, 3, 4]

Python : Regex capturing genric for 3 cases.

Hi Anyone help me imporve my not working regular expresion.
Strings Cases:
1) 120 lbs and is intended for riders ages 8 years and up. #catch : 8 years and up
2) 56w x 28d x 32h inches recommended for hobbyists recommended for ages 12 and up. #catch : 12 and up
3) 4 users recorded speech for effective use language tutor pod measures 11l x 9w x 5h inches recommended for ages 6 and above. #catch : 6 and above
I want a genric regular expression which works perfectly for all the three string.
My regular expression is :
\b\d+[\w+\s]?(?:\ban[a-z]\sup\b|\ban[a-z]\sabove\b|\ban[a-z]\sold[a-z]*\b|\b&\sup)
But it is not working quite well. If anyone can provide me a generic regular expression which works for all 3 cases. I am using python re.findall()
Anyone? could Help?
Make it a habit and start with verbose regular expressions:
import re
rx = re.compile(r'''
ages\ # look for ages
(\d+(?:\ years)?\ and\ (?:above|up)) # capture a digit, years eventually
# and one of above or up
''', re.VERBOSE)
string = '''
1) 120 lbs and is intended for riders ages 8 years and up. #catch : 8 years and up
2) 56w x 28d x 32h inches recommended for hobbyists recommended for ages 12 and up. #catch : 12 and up
3) 4 users recorded speech for effective use language tutor pod measures 11l x 9w x 5h inches recommended for ages 6 and above. #catch : 6 and above
'''
matches = rx.findall(string)
print(matches)
# ['8 years and up', '12 and up', '6 and above']
See a demo on ideone.com as well as on regex101.com.
(As the suggestion I made in a comment appears to have been what you wanted, I offer it as an answer.)
If your examples illustrate all possible strings (but I fear they don't ;) you could do it as simple as
\d+[^\d]*$
See it here at regex101.
It matches the last number, and everything after it.
Or a little bit more sophisticated - making sure it's preceded by age - here

Categories