extract datetime from a folder parth string - python

I have a list of files that are arranged in the following format:
'folder/sensor_01/2021/12/31/005_6_0.csv.gz',
'folder/sensor_01/2022/01/01/005_0_0.csv.gz',
'folder/sensor_01/2022/01/02/005_1_0.csv.gz',
'folder/sensor_01/2022/01/03/005_4_0.csv.gz',
....
Now, what I want to do is filter the entries which are within the time range. So, in the folder listings, the middle segment after sensor_01 and before 005 give the time entry (till date resolution).
I am getting stuck with how to extract this time segment from the folder path and convert it to a python DateTime object. I think I can then use the comparison operators to filter the entries.

The answer is the string to DateTime formatting.
Split
You can split the text to get the Year, Month, and Day part.
file = 'folder/sensor_01/2021/12/31/005_6_0.csv.gz'
file.split("/")
# ['folder', 'sensor_01', '2021', '12', '31', '005_6_0.csv.gz']
Here 2nd, 3rd and 4th elements are year, month and day.
Or
strptime
See https://stackoverflow.com/a/466376/2681662. You can create a DateTime object from a string. But there's no restriction of delimiters for the Year, Month, and Day separator.
So:
file = 'folder/sensor_01/2021/12/31/005_6_0.csv.gz'
datetime.strptime(file, 'folder/sensor_01/%Y/%m/%d/005_6_0.csv.gz') # This is valid
# datetime.datetime(2021, 12, 31, 0, 0)

This is easily done using regex.
\S+\/sensor_[\d]+\/([\S\/]+)\/[\S_]+\.csv\.gz
I have used this regex to match and group the date portion of one of the strings.
In [11]: import re
In [12]: string = 'folder/sensor_01/2021/12/31/005_6_0.csv.gz'
In [13]: reg = '\S+\/sensor_[\d]+\/([\S\/]+)\/[\S_]+\.csv\.gz'
In [15]: re.match(reg, string).groups()[0]
Out[15]: '2021/12/31'

Related

convert datetime column to a zero-based months. 0 is Jan and 11 is Dec

I have a datetime column that I want to 'stringify' it using strftime, the problem is I want the months to be zero-based i.e. 0=January, 11=December.
What I've tried is after I 'stringified' the column and called str.replace on it by passing a regex and a callable to convert the month to a number and subtract one from it and then convert it back to a string
Why do I want it to be zero-based?
because this data going to be consumed by Google Charts and it requires date represented as string to be zero-based
here is the code, is there a better solution?
month_regex = r",(0[1-9]|1[0-2])"
# vvv -> month_regex
format = "Date(%Y,%m,%d,%H,%M,%S)"
print(df["start"].dtype) # float64 represents an epoch
# convert epoch to datetime and then to string with the given format
df["start"] = pd.to_datetime(df["start"]//1000, unit="s").dt.strftime(format)
print(df["start"]) # Date(2022,05,24,00,00,00)
df["start"] = df["start"].str.replace(
month_regex,
lambda match: "," + str(int(match[0][1:]) - 1),
1, # first occurrence only
regex=True)
print(df["start"]) # Date(2022,4,24,00,00,00)
Simply use string formatting to achieve the same result.
df = pd.to_datetime(pd.Series(["2022-01-01"]))
# We extract the month as integer and subtract one from it. Then do string formatting
df.apply(lambda x: x.strftime("Date(%Y,%i,%d,%H,%M,%S)") % (x.month-1))
I would make your regex only capable of matching the month part of the Date(...) string by using a lookbehind for Date followed by a (, 4 digits and a comma:
(?<=Date\(\d{4},)\d\d
Then you need only worry about replacing the match:
df['start'].str.replace('(?<=Date\(\d{4},)\d\d', lambda m:f'{int(m[0])-1:02d}', regex=True)
Note I've used an f-string to ensure the output month value has 2 digits (i.e. 04 instead of 4), if that isn't necessary just use:
df['start'].str.replace('(?<=Date\(\d{4},)\d\d', lambda m:str(int(m[0])-1))

replace the date section of a string in python

if I have a string 'Tpsawd_20220320_default_economic_v5_0.xls'.
I want to replace the date part (20220320) with a date variable (i.e if I define the date = 20220410, it will replace 20220320 with this date). How should I do it with build-in python package? Please note the date location in the string can vary. it might be 'Tpsawd_default_economic_v5_0_20220320.xls' or 'Tpsawd_default_economic_20220320_v5_0.xls'
Yes, this can be done with regex fairly easily~
import re
s = 'Tpsawd_20220320_default_economic_v5_0.xls'
date = '20220410'
s = re.sub(r'\d{8}', date, s)
print(s)
Output:
Tpsawd_20220410_default_economic_v5_0.xls
This will replace the first time 8 numbers in a row are found with the given string, in this case date.

parse odd dataframe index to datetime

I have a dataframe that I've pulled from the EIA API, however, all of the index values are of the format 'YYYY mmddTHHZ dd'. For example, 11am on today's date appears as '2020 0317T11Z 17'.
What I would like to be able to do is parse this index such that there is a separate ['Date'] and ['Time']column with the date in YYYY-mm-dd format and the hour as a singular number, i.e. 11.
It is not a datetime object and I'm not sure how to parse an index and replace in this manner. Any help is appreciated.
Thanks.
Remove the excessive part:
s = pd.Series(['2020 0317T11Z 17'])
datetimes = pd.to_datetime(s.str[:-4], format='%Y %m%dT%H')
# after converting to datetime, you can extract
dates = datetimes.dt.normalize()
times = datetimes.dt.time
# or better
# times = dtatetimes - date

Pandas sets datetime to first day of month if missing day?

When I used Pandas to convert my datetime string, it sets it to the first day of the month if the day is missing.
For example:
pd.to_datetime('2017-06')
OUT[]: Timestamp('2017-06-01 00:00:00')
Is there a way to have it use the 15th (middle) day of the month?
EDIT:
I only want it to use day 15 if the day is missing, otherwise use the actual date - so offsetting all values by 15 won't work.
While this isn't possible using the actual call, you could always use regex matching to figure out if the string contains a date and proceed accordingly. Note: this code only works if using '-' delimited dates:
import re
date_str = '2017-06'
if (not bool(re.match('.+-.+-.+',date_str))):
pd.to_datetime(date_str).replace(date=15)
else:
pd.to_datetime(date_str)

Changing list answers in python

I've been trying to input into a mysql table using python, thing is I'm trying to create a list with all dates from April 2016 to now so I can insert them individually into the sql insert, I searched but I didn't find how can I change value per list result (if it's 1 digit or 2 digits):
dates = ['2016-04-'+str(i+1) for i in range(9,30)]
I would like i to add a 0 every time i is a single digit (i.e 1,2,3 etc.)
and when its double digit for it to stay that way (i.e 10, 11, 12 etc.)
dates = ['2016-04-'+ '{:02d}'.format(i) for i in range(9,30)]
>>> print dates
['2016-04-09', '2016-04-10', '2016-04-11', '2016-04-12', '2016-04-13', '2016-04-14', '2016-04-15', '2016-04-16', '2016-0
4-17', '2016-04-18', '2016-04-19', '2016-04-20', '2016-04-21', '2016-04-22', '2016-04-23', '2016-04-24', '2016-04-25', '
2016-04-26', '2016-04-27', '2016-04-28', '2016-04-29']
>>>
Using C style formatting, all the dates in April:
dates = ['2016-04-%02d'%i for i in range(1,31)]
Need range(1,31) since the last value in the range is not used, or use range(30) and add 1 to i.
The same using .format():
dates = ['2016-04-{:02}'.format(i) for i in range(1,31)]
You can use dateutil module
from datetime import datetime
from dateutil.rrule import rrule, DAILY
start_date = datetime(2016,04,01)
w=[each.strftime('%Y-%m-%d') for each in list(rrule(freq=DAILY, dtstart=start_date, until=datetime(2016,05,9)))]

Categories