How to deal with inconsistent date series in Python? - python

Inconsistent date formats
As shown in the photo above, the check-in and check-out dates are inconsistent. Whenever I try to clean convert the entire series to datetime using df['Check-in date'] = pd.to_datetime(df['Check-in date'], errors='coerce') and
df['Check-out date'] = pd.to_datetime(df['Check-out date'], errors='coerce') the days and months get mixed up. I don't really know what to do now. I also tried splitting the days months and years and re-arranging them, but I still have no luck.
My goal here is to get the total night stay of our guest but due to the inconsistency, I end up getting negative total night stays.
I'd appreciate any help here. Thanks!

You can try different formats with strptime and return a DateTime object if any of them works.
from datetime import datetime
import pandas as pd
def try_different_formats(value):
only_date_format = "%d/%m/%Y"
date_and_time_format = "%Y-%m-%d %H:%M:%S"
try:
return datetime.strptime(value,only_date_format)
except ValueError:
pass
try:
return datetime.strptime(value,date_and_time_format)
except ValueError:
return pd.NaT
in your example:
df = pd.DataFrame({'Check-in date': ['19/02/2022','2022-02-12 00:00:00']})
Check-in date
0 19/02/2022
1 2022-02-12 00:00:00
apply method will run this function on every value of the Check-in date
column. the result would be a column of DateTime objects.
df['Check-in date'].apply(try_different_formats)
0 2022-02-19
1 2022-02-12
Name: Check-in date, dtype: datetime64[ns]
for a more pandas-specific solution you can check out this answer.

Related

How to handle dates which is out of timestamp range in pandas?

I was working with the Crunchbase dataset. I have an entry of Harvard University which was founded in 1636. This entry is giving me an error when I am trying to convert string to DateTime.
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1636-09-08 00:00:00
I found out that pandas support timestamp from 1677
>>> pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145225')
I checked out some solutions like one suggesting using errors='coerce' but dropping this entry/ making it null is not an option.
Can you please suggest a way to handle this issue?
As mentioned in comments by Henry, there is limitation of pandas timestamps because of its representation in float64, you could probably work around it by parsing the date-time using datetime library when needed, otherwise letting it stay as string or convert it to an integer
Scenario 1: If you plan on showing this value only when you print it
datetime_object = datetime.strptime('1636-09-08 00:00:00', '%Y-%m-%d %H:%M:%S')
Scenario 2: If you want to use it as a date column to retain information in the dataframe, you could additionally
datetime_object.strftime("%Y%m%d%H%M%S")
using it on a column in a pandas dataframe would yield this
df=pd.DataFrame([['1636-09-08 00:00:00'],['1635-09-09 00:00:00']], columns=['dates'])
df['str_date']=df['dates'].apply(lambda x:datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
df.head()
dates
str_date
0
1636-09-08 00:00:00
1636-09-08 00:00:00
1
1635-09-09 00:00:00
1635-09-09 00:00:00
pandas treats this column as a object column, but when you access it, it is a datetime column
df['str_date'][0]
>>datetime.datetime(1636, 9, 8, 0, 0)
also, adding this for the sake of completeness: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-oob

Create date from one year with string and int error - PYTHON

I have the following problem. I want to create a date from another. To do this, I extract the year from the database date and then create the chosen date (day = 30 and month = 9) being the year extracted from the database.
The code is the following
bbdd20Q3['year']=(pd.DatetimeIndex(bbdd20Q3['datedaymonthyear']).year)
y=(bbdd20Q3['year'])
m=int(9)
d=int(30)
bbdd20Q3['mydate']=dt.datetime(y,m,d)
But error message is this
"cannot convert the series to <class 'int'>"
I think dt mean datetime, so the line 'dt.datetime(y,m,d)' create datetime object type.
bbdd20Q3['mydate'] should get int?
If so, try to think of another way to store the date (8 numbers maybe).
hope I helped :)
I assume that you did import datetime as dt then by doing:
bbdd20Q3['year']=(pd.DatetimeIndex(bbdd20Q3['datedaymonthyear']).year)
y=(bbdd20Q3['year'])
m=int(9)
d=int(30)
bbdd20Q3['mydate']=dt.datetime(y,m,d)
You are delivering series as first argument to datetime.datetime, when it excepts int or something which can be converted to int. You should create one datetime.datetime for each element of series not single datetime.datetime, consider following example
import datetime
import pandas as pd
df = pd.DataFrame({"year":[2001,2002,2003]})
df["day"] = df["year"].apply(lambda x:datetime.datetime(x,9,30))
print(df)
Output:
year day
0 2001 2001-09-30
1 2002 2002-09-30
2 2003 2003-09-30
Here's a sample code with the required logic -
import pandas as pd
df = pd.DataFrame.from_dict({'date': ['2019-12-14', '2020-12-15']})
print(df.dtypes)
# convert the date in string format to datetime object,
# if the date column(Series) is already a datetime object then this is not required
df['date'] = pd.to_datetime(df['date'])
print(f'after conversion \n {df.dtypes}')
# logic to create a new data column
df['new_date'] = pd.to_datetime({'year':df['date'].dt.year,'month':9,'day':30})
#eollon I see that you are also new to Stack Overflow. It would be better if you can add a simple sample code, which others can tryout independently
(keeping the comment here since I don't have permission to comment :) )

python time stamp convert to datetime without a year specified

I have a csv file of a years worth of time series data where the time stamp looks like the code insert below. One thing to mention about the data its a 30 year averaged hourly weather data, so there isnt a year specified with the time stamp.
Date
01-01T01:00:00
01-01T02:00:00
01-01T03:00:00
01-01T04:00:00
01-01T05:00:00
01-01T06:00:00
01-01T07:00:00
01-01T08:00:00
01-01T09:00:00
01-01T10:00:00
01-01T11:00:00
01-01T12:00:00
01-01T13:00:00
01-01T14:00:00
01-01T15:00:00
01-01T16:00:00
01-01T17:00:00
01-01T18:00:00
01-01T19:00:00
01-01T20:00:00
01-01T21:00:00
01-01T22:00:00
01-01T23:00:00
I can read the csv file just fine:
df = pd.read_csv('weather_cleaned.csv', index_col='Date', parse_dates=True)
If I do a pd.to_datetime(df) this will error out:
ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
Would anyone have any tips to convert my df to datetime?
You can pass date_parser argument (check docs), e.g.
import pandas as pd
from datetime import datetime
df = pd.read_csv('weather_cleaned.csv', index_col='Date', parse_dates=['Date'],
date_parser=lambda x: datetime.strptime(x, '%d-%mT%H:%M:%S'))
print(df.head())
output
Empty DataFrame
Columns: []
Index: [1900-01-01 01:00:00, 1900-01-01 02:00:00, 1900-01-01 03:00:00, 1900-01-01 04:00:00, 1900-01-01 05:00:00]
of course you can define different function, maybe specify different year, etc..
e.g. if you want year 2020 instead of 1900 use
date_parser=lambda x: datetime.strptime(x, '%d-%mT%H:%M:%S').replace(year=2020)
Note I assume it's day-month format, change format string accordingly.
EDIT: Change my example to reflect that Date column should be used as index.
One thing you can do is to append a default year:
pd.to_datetime('2020-' + df['Date'])

Comparison between datetime and datetime64[ns] in pandas

I'm writing a program that checks an excel file and if today's date is in the excel file's date column, I parse it
I'm using:
cur_date = datetime.today()
for today's date. I'm checking if today is in the column with:
bool_val = cur_date in df['date'] #evaluates to false
I do know for a fact that today's date is in the file in question. The dtype of the series is datetime64[ns]
Also, I am only checking the date itself and not the timestamp afterwards, if that matters. I'm doing this to make the timestamp 00:00:00:
cur_date = datetime.strptime(cur_date.strftime('%Y_%m_%d'), '%Y_%m_%d')
And the type of that object after printing is datetime as well
For anyone who also stumbled across this when comparing a dataframe date to a variable date, and this did not exactly answer your question; you can use the code below.
Instead of:
self.df["date"] = pd.to_datetime(self.df["date"])
You can import datetime and then add .dt.date to the end like:
self.df["date"] = pd.to_datetime(self.df["date"]).dt.date
You can use
pd.Timestamp('today')
or
pd.to_datetime('today')
But both of those give the date and time for 'now'.
Try this instead:
pd.Timestamp('today').floor('D')
or
pd.to_datetime('today').floor('D')
You could have also passed the datetime object to pandas.to_datetime but I like the other option mroe.
pd.to_datetime(datetime.datetime.today()).floor('D')
Pandas also has a Timedelta object
pd.Timestamp('now').floor('D') + pd.Timedelta(-3, unit='D')
Or you can use the offsets module
pd.Timestamp('now').floor('D') + pd.offsets.Day(-3)
To check for membership, try one of these
cur_date in df['date'].tolist()
Or
df['date'].eq(cur_date).any()
When converting datetime64 type using pd.Timestamp() it is important to note that you should compare it to another timestamp type. (not a datetime.date type)
Convert a date to numpy.datetime64
date = '2022-11-20 00:00:00'
date64 = np.datetime64(date)
Seven days ago - timestamp type
sevenDaysAgoTs = (pd.to_datetime('today')-timedelta(days=7))
convert date64 to Timestamp and see if it was in the last 7 days
print(pd.Timestamp(pd.to_datetime(date64)) >= sevenDaysAgoTs)

pandas difference between 2 dates

I am trying to find the day difference between today, and dates in my dataframe.
Below is my conversion of dates in my dataframe
df['Date']=pd.to_datetime(df['Date'])
Below is my code to get today
today1=dt.datetime.today().strftime('%Y-%m-%d')
today1=pd.to_datetime(today1)
Both are converted to pandas.to_datetime, but when I do subtraction, the below error came out.
ValueError: Cannot add integral value to Timestamp without offset.
Can someone help to advise? Thanks!
This is a simple example how you can do this:
import pandas
import datetime as dt
First, you have to get today.
today1=dt.datetime.today().strftime('%Y-%m-%d')
today1=pd.to_datetime(today1)
Then, you can construct the data frame:
df = pandas.DataFrame({'Date':'2016-11-24 11:03:10.050000', 'today1': today1 }, index = [0])
In this example I just have 2 columns, each with one value.
Next, you should check the data types:
print(df.dtypes)
Date datetime64[ns]
today1 datetime64[ns]
If both data types are datetime64[ns], you can then subtract df.Date from df.today1.
print(df.today1 - df.Date)
The output:
0 19 days 12:56:49.950000
dtype: timedelta64[ns]

Categories