Issue on Matching the string and replacing the hex in python - python

st = """
What kind of speCialist would we see for this?He also seems to have reactions to the red dye cochineal/carmine cialist,I like Cialist much
"""
here I need to replace only the Cialist string(exact match) also it may has comma at the end
The word "spe*cialist*" should not be thrown
i tried with this regex.
bold_string = "<b>"+"Cialist"+"</b>"
insensitive_string = re.compile(re.escape("cialist"), re.IGNORECASE)
comment = insensitive_string.sub(bold_string,st)
but it throws the string specialist also.
Could you suggest me to fix this?
One more issue with replacing the hexadecimal character in python.
date_str = "28-06-2010\xc3\x82\xc2\xa008:48 PM"
date_str = date_str.replace("\n","").replace("\t","").replace("\r","").replace("\xc3\x82\xc2\xa"," ")
date_obj = datetime.strptime(date_str,"%d-%m-%Y %H:%M %p")
Error: time data '08-09-2005\xc3\x82\xc2\xa010:18 PM' does not match format '%d-%m-%Y %H:%M %p'
Here I am not able to replace the hex characters with space for matching with datetime pattern .
Could you please help out of this issue?

For your second Q:
>>> re.sub(r'\\[a-zA-z0-9]{2}', lambda L: str(int(L.group()[2:], 16)), text)
'28-06-20101238212210008:48 PM'
That either re-organise that for your strptime, or have strptime interpret that.

Two questions in one?
replace your regex with a word boundary so it's re.sub(r'\bcialist\b', '', your_string, re.I)

Use \b to match a word boundary. Then it becomes simples :)
import re
st = """
What kind of speCialist would we see for this?He also seems to have reactions to the red dye cochineal/carmine cialist,I like Cialist much
"""
print re.sub(r'\bCialist\b', "<b>Cialist</b>", st)
For the second question you're missing a 0 at the end of your last replace string. Just add 0 and it works :)
date_str = "28-06-2010\xc3\x82\xc2\xa008:48 PM"
print date_str
date_str = date_str.replace("\n","").replace("\t","").replace("\r","").replace("\xc3\x82\xc2\xa0"," ")
print repr(date_str)

Related

Extract date from file name with import re in python

My file name looks like as follow:
show_data_paris_12112019.xlsx
I want to extract the date only and I have tried this script:
date = os.path.basename(xls)
pattern = r'(?<=_)_*(?=\.xlsx)'
re.search(pattern, date).group(0)
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%Y%m%d')
but It gives me errors. How can I fix this?
Thank you.
It looks to me like the regex you're using is also at fault, and so it fails when trying to group(0) from the empty return.
Assuming all your dates are stored as digits the following regex i've made seems to work quite well.
(?!.+_)\d+(?=\.xlsx)
The next issue is when formatting the date it experiences an issue with the way you're formatting the date, to me it looks like 12112019 would be the 12/11/2019 obviously this could also be the 11/12/2019 but the basic is that we change the way strftime formats the date.
So for the date / month / year format we would use
# %d%m%Y
event_date_obj = datetime.strptime(event_date, '%d%m%Y')
And we would simply swap %d and %m for the month / date / year format. So your complete code would look something like this:
date = os.path.basename(xls)
pattern = "(?!.+_)\d+(?=\.xlsx)"
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%d%m%Y')
For further information on how to use strftime see https://strftime.org/.
_* matches a sequence of zero or more underscore characters.
(?<=_) means that it has to be preceded by an underscore.
(?=\.xlsx) means that it has to be followed by .xlsx.
So this will match the ___ in foo____.xlsx. But it doesn't match your filename, because the data is between the underscore and .xlsx.
You should match \d+ rather than _* between the lookbehind and lookahead.
pattern = r'(?<=_)\d+(?=\.xlsx)'
And if the data is always 8 digits, use \d{8} to be more specific.

How to match arbitrary characters in a str when formatting datetime in Python?

I want to convert a string to a datetime in Python. But there are some characters, which are irrelevant in the String.
Python Datetime Document
Is the official document, there is no clear way to match arbitrary characters in the String.
For example, I have a String 2018-01-01Ajrwoe.
I want to convert this string into a datetime as 2018 (year) 01 (month) and 01 (day).
The rest of this string is irrelevant.
I know that I can change the string (remove the irrelevant characters) first like
raw_str = "2018-01-01Ajrwoe"
my_str = raw_str[:10]
strptime(my_str, my_format)
But I just want to directly match the arbitrary characters like
raw_str = "2018-01-01Ajrwoe"
strptime(raw_str, my_format)
But how is my_format?
Thanks in advance.
Here is a way to do it using a regex to clean your string :
raw_str = "2018-01-01Ajrwoe"
datetime.datetime.strptime(re.sub('[a-z|A-Z]', '', raw_str), '%Y-%m-%d')
Output :
datetime.datetime(2018, 1, 1, 0, 0)
You could match the date using a regex.
For example:
import re
raw_str = "2018-01-01Ajrwoe"
match = re.search(r'\d+-\d+-\d+', raw_str)
print(match.group(0))

Using regex in Python to insert a space between a date and time

I have a list of dates and times in the following format:
25/07/201711:00:00
I just want to insert whitespace between the date and time so it looks like:
25/07/2017 11:00:00
The string replace method works well but is not very robust i.e. mystring.replace("2017","2017 " ) works but only for 2017 dates. Regex sub method seems to be what I need to use but have not been successful so far. Any suggestions would be very helpful as my regex knowledge is limited.
This is closest from what I have tried:
>>>re.sub(r'20[0-9][0-9]', r'20[0-9][0-9] ', s)
'20[0-9][0-9] 04:00'
If your dates are in 'dd/mm/yyyy' format, why can't you just index your string?
>>> mystring[:10] + ' ' + mystring[10:]
'25/07/2017 11:00:00'
One way would be to use lookarounds:
(?<=\d{4})(?=\d{2}:)
This needs to be replaced by a whitespace, see a demo on regex101.com.
In Python this would be
import re
date = "25/07/201711:00:00"
date = re.sub(r'(?<=\d{4})(?=\d{2}:)', ' ', date)
print(date)
# 25/07/2017 11:00:00
As seen in the comments section, if the date is always of the same format, one might better slice the strings.
You don't need a regular expression for this, assuming that the dates are padded, so that they're all the same length (which seems to be the case).
>>> date = '25/07/201711:00:00'
>>> n = len('dd/mm/yyyy') # Splitting index (easier to understand than a magic constant)
>>> print(date[:n] + ' ' + date[n:])
25/07/2017 11:00:00
The best way is to match the full date and time in two groups, for instance like this:
import re
regex = r"(\d{2}/\d{2}/\d{4})(\d{2}:\d{2}:\d{2})"
re.sub(regex, r"\1 \2", "25/07/201711:00:00")
# -> '25/07/2017 11:00:00'

Python stripping characters(words) out of a string

I have a string
string = "Friday07:48 AM"
How do I get rid of "Friday"? I could simply use a replace() function but this string could also be any other day of the week. So it could look like:
string = "Sunday07:48 AM"
How do I only get "07:48 AM"?
We can utilize the fact that every day of the week in English ends in the substring 'day' to locate that within your string, and then go from three indices farther from where 'day' starts until the end of your string.
date_str = "Friday07:48 AM"
new_str = date_str[date_str.index('day')+3:]
new_str # '07:48 AM'
As an aside, never name a string 'string' or 'str', because those are special words in Python.
Well for everything except Saturday and Wednesday you could just grab it by:
day = string[0:6]
but in the all cases IF the time stamp in the front always holds the same index you could:
day = string[:-8]
Try it in the shell
How about using regular expressions?
import re
re.search("([a-z]+)(.*)",string,flags=re.I).group(2)

Number Trouble with Regex in Python

I'm trying to filter a date retrieved from a .csv file, but no combination I try seems to work. The date comes in as "2011-10-01 19:25:01" or "year-month-date hour:min:sec".
I want just the year, month and date but I get can't seem to get ride of the time from the string:
date = bug[2] # Column in which the date is located
date = date.replace('\"','') #getting rid of the quotations
mdate = date.replace(':','')
re.split('$[\d]+',mdate) # trying to get rid of the trailing set of number (from the time)
Thanks in advance for the advice.
If your source is a string, you'd probably better use strptime
import datetime
string = "2011-10-01 19:25:01"
dt = datetime.datetime.strptime(string, "%Y-%m-%d %H:%M:%S")
After that, use
dt.year
dt.month
dt.day
to access the data you want.
Use datetime to parse your input as a datetime object, then output it in whatever format you like: http://docs.python.org/library/datetime.html
I think you're confusing the circumflex for start of line and dollar for end of line. Try ^[\d-]+.
If the format is always "YYYY-MM-DD HH:mm:ss", then try this:
date = date[1:11]
In a prompt:
>>> date = '"2012-01-12 15:13:20"'
>>> date[1:11]
'2012-01-12'
>>>
No need for regex
>>> date = '"2011-10-01 19:25:01"'
>>> date.strip('"').split()[0]
'2011-10-01'
One problem with your code is that in your last regular expression, $ matches the end of the string, so that regular expression will never match anything. You could do this much more simply by splitting by spaces and only taking the first result. After removing the quotation marks, the line
date.split()
will return ["2011-10-01", "19:25:01"], so the first element of that list is what you need.

Categories