Date regex produce false output - python

I need to find the files whose pattern match to the date format in Python. Could someone please help me on this. I have a regex but it's not working as required.
date = '2012-01-15'
match = re.findall(r'^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$', date)
print match
Output:
[('20', '01','15')]

Seems like you just missed a pair of parenthesis around the complete year match and you probably want to suppress the century match with ?::
match = re.findall(r'^((?:19|20)\d\d)[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$', date)
# ^ ^^ ^
this gives [('2012', '01', '15')] on your example

Related

Extract date from file name with import re in python

My file name looks like as follow:
show_data_paris_12112019.xlsx
I want to extract the date only and I have tried this script:
date = os.path.basename(xls)
pattern = r'(?<=_)_*(?=\.xlsx)'
re.search(pattern, date).group(0)
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%Y%m%d')
but It gives me errors. How can I fix this?
Thank you.
It looks to me like the regex you're using is also at fault, and so it fails when trying to group(0) from the empty return.
Assuming all your dates are stored as digits the following regex i've made seems to work quite well.
(?!.+_)\d+(?=\.xlsx)
The next issue is when formatting the date it experiences an issue with the way you're formatting the date, to me it looks like 12112019 would be the 12/11/2019 obviously this could also be the 11/12/2019 but the basic is that we change the way strftime formats the date.
So for the date / month / year format we would use
# %d%m%Y
event_date_obj = datetime.strptime(event_date, '%d%m%Y')
And we would simply swap %d and %m for the month / date / year format. So your complete code would look something like this:
date = os.path.basename(xls)
pattern = "(?!.+_)\d+(?=\.xlsx)"
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%d%m%Y')
For further information on how to use strftime see https://strftime.org/.
_* matches a sequence of zero or more underscore characters.
(?<=_) means that it has to be preceded by an underscore.
(?=\.xlsx) means that it has to be followed by .xlsx.
So this will match the ___ in foo____.xlsx. But it doesn't match your filename, because the data is between the underscore and .xlsx.
You should match \d+ rather than _* between the lookbehind and lookahead.
pattern = r'(?<=_)\d+(?=\.xlsx)'
And if the data is always 8 digits, use \d{8} to be more specific.

pandas column modification with regular expression

i want to fix some string entries in pandas series, such that all values with pattern '0x.202' (last digit of year is missing) will be appended with one zero at the end (so that it is the full date of format 'mm.yyyy'). Here is the pattern i got:
pattern = '\d*\.202(?:$|\W)'
Matches exactly the 2 digits separated by point and exactly 202 in the end. Could you please help me with the way how to replace the strings in series, while preserving original indexes?
My current way to do this is:
date = df['Calendar Year/Month'].astype('str')
pattern = re.compile('\d*\.202(?:$|\W)')
date.str.replace(pattern, pattern.pattern + '0', regex=True)
but i get an error:
error: bad escape \d at position 0
Edit: Sorry for lack of details, i forgot to mention that dates were misinterpreted by pandas as floats, so that is why dates with year 2020 were not completely shown (5.2020 is rounded to 5.202, for example). So the expression i used:
date = df['Year/Month'].astype('str')
date = date.apply(lambda _: _ if _[-1] == '1' or _[-1] == '9' else f'{_}0')
So that only 'xx.202' are edited and dates like 'xx.2021' and 'xx.2019' are omitted. Thanks everyone for help!
Do you have to use regex here? If not, this would work (add a 0 if the length of the string is x).
df["Calendar Year/Month"].apply(lambda _: _ if len(_)==7 else f'{_}0')
Or maybe this (add a 0 if the last digit is 2):
df["Calendar Year/Month"].apply(lambda _: _ if _[-1] == 0 else f'{_}0')
I would do a str.replace:
df = pd.DataFrame({'Year/Month':['10.202 abc', 'abc 1.202']})
df['Year/Month'].str.replace(r'(\d*\.202)\b', r'\g<1>0')
Output:
0 10.2020 abc
1 abc 1.2020
Name: Year/Month, dtype: object

regex python + variable

guys i hope you can give me a hand with this:
Im trying to find a match on a variable value:
net_card is a string
net_card = salida.read()
regex = re.compile('([a-z])\w+' % re.escape(net_card))
if i run this code it show me this error:
regex = re.compile('([a-z])\w+' % re.escape(net_card))
TypeError: not all arguments converted during string formatting
I haven't found a way to solve this, even with scape characters.
now if i do this:
net_card = salida.read()
match = re.search('([a-z])\w+', net_card)
whatIWant = match.group(1) if match else None
print whatIWant
it shows me just (e) in the output even when the value of net_card is NAME=ens32.
Your regex, ([a-z])\w+, will match a single character in the range a-z as the first group, and match the rest of the string as [a-zA-Z0-9_]+. Instead, match the two groups of \w+ (which is [a-zA-Z0-9_]+ in evaluation), separated by an equal sign. Here's an expression:
(\w+)=(\w+)
In practice (if you don't care about "NAME"), you can remove the first group and use:
net_card = salida.read()
match = re.match('\w+=(\w+)', net_card)
print(match.group(1) if match else None)
Which will output ens32.

Python regex similar expressions

I have a file with two different types of data I'd like to parse with a regex; however, the data is similar enough that I can't find the correct way to distinguish it.
Some lines in my file are of form:
AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Other lines are of form
AED=20180823
AMD=20150914
AMD=20150921
The remaining lines are headers and I'd like to discard them. For example
[HEADER: BUSINESS DATE=20160831]
My solution attempt so far is to match first three capital letters and an equal sign,
r'\b[A-Z]{3}=\b'
but after that I'm not sure how to distinguish between dates (eg 20180823) and days (eg FRI:SAT:SUN).
The results I'd expect from these parsing functions:
Regex weekday_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=FRI>);
Regex date_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=20160816>);
weekdays = [weekday_rx.Match(line) for line in infile.read()]
dates = [date_rx.Match(line) for line in infile.read()]
r'\S*\d$'
Will match all non-whitespace characters that end in a digit
Will match AED=20180823
r'\S*[a-zA-Z]$'
Matches all non-whitespace characters that end in a letter.
will match AED=AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Neither will match
[HEADER: BUSINESS DATE=20160831]
This will match both
r'(\S*[a-zA-Z]$|\S*\d$)'
Replacing the * with the number of occurences you expect will be safer, the (a|b) is match a or match b
The following is a solution in Python :)
import re
p = re.compile(r'\b([A-Z]{3})=((\d)+|([A-Z])+)')
str_test_01 = "AMD=SUN:SAT"
m = p.search(str_test_01)
print (m.group(1))
print (m.group(2))
str_test_02 = "AMD=20150921"
m = p.search(str_test_02)
print (m.group(1))
print (m.group(2))
"""
<Output>
AMD
SUN
AMD
20150921
"""
Use pipes to express alternatives in regex. Pattern '[A-Z]{3}:[A-Z]{3}|[A-Z]{3}' will match both ABC and ABC:ABC. Then use parenthesis to group results:
import re
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC:ABC')
assert match.groups() == ('ABC:ABC', None)
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC')
assert match.groups() == (None, 'ABC')
You can research the concept of named groups to make this even more readable. Also, take a look at the docs for the match object for useful info and methods.

Issue on Matching the string and replacing the hex in python

st = """
What kind of speCialist would we see for this?He also seems to have reactions to the red dye cochineal/carmine cialist,I like Cialist much
"""
here I need to replace only the Cialist string(exact match) also it may has comma at the end
The word "spe*cialist*" should not be thrown
i tried with this regex.
bold_string = "<b>"+"Cialist"+"</b>"
insensitive_string = re.compile(re.escape("cialist"), re.IGNORECASE)
comment = insensitive_string.sub(bold_string,st)
but it throws the string specialist also.
Could you suggest me to fix this?
One more issue with replacing the hexadecimal character in python.
date_str = "28-06-2010\xc3\x82\xc2\xa008:48 PM"
date_str = date_str.replace("\n","").replace("\t","").replace("\r","").replace("\xc3\x82\xc2\xa"," ")
date_obj = datetime.strptime(date_str,"%d-%m-%Y %H:%M %p")
Error: time data '08-09-2005\xc3\x82\xc2\xa010:18 PM' does not match format '%d-%m-%Y %H:%M %p'
Here I am not able to replace the hex characters with space for matching with datetime pattern .
Could you please help out of this issue?
For your second Q:
>>> re.sub(r'\\[a-zA-z0-9]{2}', lambda L: str(int(L.group()[2:], 16)), text)
'28-06-20101238212210008:48 PM'
That either re-organise that for your strptime, or have strptime interpret that.
Two questions in one?
replace your regex with a word boundary so it's re.sub(r'\bcialist\b', '', your_string, re.I)
Use \b to match a word boundary. Then it becomes simples :)
import re
st = """
What kind of speCialist would we see for this?He also seems to have reactions to the red dye cochineal/carmine cialist,I like Cialist much
"""
print re.sub(r'\bCialist\b', "<b>Cialist</b>", st)
For the second question you're missing a 0 at the end of your last replace string. Just add 0 and it works :)
date_str = "28-06-2010\xc3\x82\xc2\xa008:48 PM"
print date_str
date_str = date_str.replace("\n","").replace("\t","").replace("\r","").replace("\xc3\x82\xc2\xa0"," ")
print repr(date_str)

Categories