Extract date from strings that contains names+dates - python

I need to extract the dates from a series of strings like this:
'MIHAI MĂD2Ă3.07.1958'
or
'CLAUDIU-MIHAI17.12.1999'
How to do this?
Tried this:
for index,row in DF.iterrows():
try:
if math.isnan(row['Data_Nasterii']):
match = re.search(r'\d{2}.\d{2}.\d{4}', row['Prenume'])
date = datetime.strptime(match.group(), '%d.%m.%Y').date()
s = datetime.strftime(datetime.strptime(str(date), '%Y-%m-%d'), '%d-%m-%Y')
row['Data_Nasterii'] = s
except TypeError:
pass

The . (dot) in regex doesn't mean the character dot, it means "anything" and needs to be escaped (\) to be an actual dot. other than that your first group is \d{2} but some of your dates have a single digit day. I would use the following:
re.search(r'(\d+\.\d+\.\d+)', row['Prenume'])
which means at least one number followed by a dot followed by at least one number.....
if you have some mixed characters in your day you can try the following (sub par) solution:
''.join(re.search(r'(\d*)(?:[^0-9\.]*)(\d*\.\d+\.\d+)', row['Prenume']).groups())
this will filter out up to one block in your "day", its not pretty but it works(and returns a string)

You can use the str accessor along with a regex:
DF['Prenume'].str.extract(r'\d{1,2}\.\d{2}\.\d{4}')

You need to escape the dot (.) as \. or you can use it inside a character class - "[.]". It is a meta character in regex, which matches any character. If you need to validate more you can refer this!
eg: r'[0-9]{2}[.][0-9]{2}[.][0-9]{4}' or r'\d{2}\.\d{2}\.\d{4}'
text = 'CLAUDIU-MIHAI17.12.1999'
pattern = r'\d{2}\.\d{2}\.\d{4}'
if re.search(pattern, text):
print("yes")

Another good solution could be using dateutil.parser:
import pandas as pd
import dateutil.parser as dparser
df = pd.DataFrame({'A': ['MIHAI MĂD2Ă3.07.1958',
'CLAUDIU-MIHAI17.12.1999']})
df['userdate'] = df['A'].apply(lambda x: dparser.parse(x.encode('ascii',errors='ignore'),fuzzy=True))
output
A userdate
0 MIHAI MĂD2Ă3.07.1958 1958-07-23
1 CLAUDIU-MIHAI17.12.1999 1999-12-17

Related

Extract date from file name with import re in python

My file name looks like as follow:
show_data_paris_12112019.xlsx
I want to extract the date only and I have tried this script:
date = os.path.basename(xls)
pattern = r'(?<=_)_*(?=\.xlsx)'
re.search(pattern, date).group(0)
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%Y%m%d')
but It gives me errors. How can I fix this?
Thank you.
It looks to me like the regex you're using is also at fault, and so it fails when trying to group(0) from the empty return.
Assuming all your dates are stored as digits the following regex i've made seems to work quite well.
(?!.+_)\d+(?=\.xlsx)
The next issue is when formatting the date it experiences an issue with the way you're formatting the date, to me it looks like 12112019 would be the 12/11/2019 obviously this could also be the 11/12/2019 but the basic is that we change the way strftime formats the date.
So for the date / month / year format we would use
# %d%m%Y
event_date_obj = datetime.strptime(event_date, '%d%m%Y')
And we would simply swap %d and %m for the month / date / year format. So your complete code would look something like this:
date = os.path.basename(xls)
pattern = "(?!.+_)\d+(?=\.xlsx)"
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%d%m%Y')
For further information on how to use strftime see https://strftime.org/.
_* matches a sequence of zero or more underscore characters.
(?<=_) means that it has to be preceded by an underscore.
(?=\.xlsx) means that it has to be followed by .xlsx.
So this will match the ___ in foo____.xlsx. But it doesn't match your filename, because the data is between the underscore and .xlsx.
You should match \d+ rather than _* between the lookbehind and lookahead.
pattern = r'(?<=_)\d+(?=\.xlsx)'
And if the data is always 8 digits, use \d{8} to be more specific.

Is there a way in python/pandas to remove a particular set of characters from a string

Is there a way to remove a particular set of characters from python string in one go?
str='23.889,45 €'
I want to remove dot '.' and '€' sign, but I do not want to use replace() function two times like str.replace('€','').replace('.',''), whereby replacing the characters with white space.
In SAS there is a function compress which takes a list of characters to be removed and on applying that function all the characters present in a SAS string will be removed. For eg: compress(str,'.€') will return str as 23889,45.
Is there a corresponding function in Python as well?
Multiple char removal
You may use a regex to perform multiple character replacement.
The construct you are interested in can be a character class or a grouping with alternation.
Character classes are [...] with characters, character ranges or shorthand character classes inside them, and alternation groups are (...|....|.....) like patterns. There may be a problem with using literal chars in both constructs, but re.escape comes to rescue: it will make sure the chars you pass to the regex are treated as literal chars.
See a Python 3 demo:
>>> import re
>>> charsToRemove = ["$", ".", "€"]
>>> s='23.889,45 €'
>>> print(re.sub("|".join([re.escape(x) for x in charsToRemove]), "", s)) # Alternation group
23889,45
>>> print(re.sub(r"[{}]+".format("".join([re.escape(x) for x in charsToRemove])), "", s)) # Character class
23889,45
In Pandas, you'd use
df['col'].str.replace(r"[{}]+".format("".join([re.escape(x) for x in charsToRemove])),"", regex=True, inplace=True)
Note that the character class approach ([...]+) will work faster.
Multiple replacements
You may consider creating a dictionary of replacements and then use it with Pandas replace:
>>> from pandas import DataFrame
>>> import pandas as pd
>>> import regex
>>> repl_list = {'€':'$', ',':'.', r'\.': ''}
>>> col_list = ['23.889,45 €']
>>> frame = pd.DataFrame(col_list, columns=['col'])
>>> frame['col'].replace(repl_list, regex=True, inplace=True)
>>> frame['col']
0 23889.45 $
To make it work, you must use regex=True argument and add import re as all the keys in repl_list are regular expressions. Do not forget to escape special regex chars in there. See What special characters must be escaped in regular expressions? Or, you may write r'\.' as re.escape('.').
The compress function you are talking about must be doing something like this:
str='23.889,45 €'
charsToRemove = ["$", ".", "€"]
def compress(str, charsToRemove):
for i in range(len(charsToRemove)):
str = str.replace(charsToRemove[i], '')
return str
print compress(str, charsToRemove) # returns '23889,45 '

Python regex similar expressions

I have a file with two different types of data I'd like to parse with a regex; however, the data is similar enough that I can't find the correct way to distinguish it.
Some lines in my file are of form:
AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Other lines are of form
AED=20180823
AMD=20150914
AMD=20150921
The remaining lines are headers and I'd like to discard them. For example
[HEADER: BUSINESS DATE=20160831]
My solution attempt so far is to match first three capital letters and an equal sign,
r'\b[A-Z]{3}=\b'
but after that I'm not sure how to distinguish between dates (eg 20180823) and days (eg FRI:SAT:SUN).
The results I'd expect from these parsing functions:
Regex weekday_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=FRI>);
Regex date_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=20160816>);
weekdays = [weekday_rx.Match(line) for line in infile.read()]
dates = [date_rx.Match(line) for line in infile.read()]
r'\S*\d$'
Will match all non-whitespace characters that end in a digit
Will match AED=20180823
r'\S*[a-zA-Z]$'
Matches all non-whitespace characters that end in a letter.
will match AED=AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Neither will match
[HEADER: BUSINESS DATE=20160831]
This will match both
r'(\S*[a-zA-Z]$|\S*\d$)'
Replacing the * with the number of occurences you expect will be safer, the (a|b) is match a or match b
The following is a solution in Python :)
import re
p = re.compile(r'\b([A-Z]{3})=((\d)+|([A-Z])+)')
str_test_01 = "AMD=SUN:SAT"
m = p.search(str_test_01)
print (m.group(1))
print (m.group(2))
str_test_02 = "AMD=20150921"
m = p.search(str_test_02)
print (m.group(1))
print (m.group(2))
"""
<Output>
AMD
SUN
AMD
20150921
"""
Use pipes to express alternatives in regex. Pattern '[A-Z]{3}:[A-Z]{3}|[A-Z]{3}' will match both ABC and ABC:ABC. Then use parenthesis to group results:
import re
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC:ABC')
assert match.groups() == ('ABC:ABC', None)
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC')
assert match.groups() == (None, 'ABC')
You can research the concept of named groups to make this even more readable. Also, take a look at the docs for the match object for useful info and methods.

Python replace regex

I have a string in which there are some attributes that may be empty:
[attribute1=value1, attribute2=, attribute3=value3, attribute4=]
With python I need to sobstitute the empty values with the value 'None'. I know I can use the string.replace('=,','=None,').replace('=]','=None]') for the string but I'm wondering if there is a way to do it using a regex, maybe with the ?P<name> option.
You can use
import re
s = '[attribute1=value1, attribute2=, attribute3=value3, attribute4=]'
re.sub(r'=(,|])', r'=None\1', s)
\1 is the match in parenthesis.
With python's re module, you can do something like this:
# import it first
import re
# your code
re.sub(r'=([,\]])', '=None\1', your_string)
You can use
s = '[attribute1=value1, attribute2=, attribute3=value3, attribute4=]'
re.sub(r'=(?!\w)', r'=None', s)
This works because the negative lookahead (?!\w) checks if the = character is not followed by a 'word' character. The definition of "word character", in regular expressions, is usually something like "a to z, 0 to 9, plus underscore" (case insensitive).
From your example data it seems all attribute values match this. It will not work if the values may start with something like a comma (unlikely), may be quoted, or may start with anything else. If so, you need a more fool proof setup, such as parse from the start: skipping the attribute name by locating the first = character.
Be specific and use a character class:
import re
string = "[attribute1=value1, attribute2=, attribute3=value3, attribute4=]"
rx = r'\w+=(?=[,\]])'
string = re.sub(rx, '\g<0>None', string)
print string
# [attribute1=value1, attribute2=None, attribute3=value3, attribute4=None]

Python RegEx search and replace with part of original expression

I'm new to Python and looking for a way to replace all occurrences of "[A-Z]0" with the [A-Z] portion of the string to get rid of certain numbers that are padded with a zero. I used this snippet to get rid of the whole occurrence from the field I'm processing:
import re
def strip_zeros(s):
return re.sub("[A-Z]0", "", s)
test = strip_zeros(!S_fromManhole!)
How do I perform the same type of procedure but without removing the leading letter of the "[A-Z]0" expression?
Thanks in advance!
Use backreferences.
http://www.regular-expressions.info/refadv.html "\1 through \9 Substituted with the text matched between the 1st through 9th pair of capturing parentheses."
http://docs.python.org/2/library/re.html#re.sub "Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern."
Untested, but it would look like this:
return re.sub(r"([A-Z])0", r"\1", s)
Placing the first letter inside a capture group and referencing it with \1
you can try something like
In [47]: s = "ab0"
In [48]: s.translate(None, '0')
Out[48]: 'ab'
In [49]: s = "ab0zy"
In [50]: s.translate(None, '0')
Out[50]: 'abzy'
I like Patashu's answer for this case but for the sake of completeness, passing a function to re.sub instead of a replacement string may be cleaner in more complicated cases. The function should take a single match object and return a string.
>>> def strip_zeros(s):
... def unpadded(m):
... return m.group(1)
... return re.sub("([A-Z])0", unpadded, s)
...
>>> strip_zeros("Q0")
'Q'

Categories