I wanted to re-assign/replace my new value, from my current
20000123
19850123
19880112
19951201
19850123
20190821
20000512
19850111
19670133
19850123
As you can see there is data with 19670133 (YYYYMMDD), which means that date is not exist since there is no month with 33 days in it.So I wanted to re assign it to the end of the month. I tried to make it to the end of the month, and it works.
But when i try to replace the old value with the new ones, it became a problem.
What I've tried to do is this :
for x in df_tmp_customer['date']:
try:
df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x), axis=1)
except Exception:
df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x[0:6]+"01")+ pd.offsets.MonthEnd(n=0), axis=1)
This part is the one that makes it end of the month :
pd.to_datetime(x[0:6]+"01")+ pd.offsets.MonthEnd(n=0)
Probably not efficient on a large dataset but can be done using pendulum.parse()
import pendulum
def parse_dates(x: str) -> pendulum:
i = 0
while ValueError:
try:
return pendulum.parse(str(int(x) - i)).date()
except ValueError:
i += 1
df["date"] = df["date"].apply(lambda x: parse_dates(x))
print(df)
date
0 2000-01-23
1 1985-01-23
2 1988-01-12
3 1995-12-01
4 1985-01-23
5 2019-08-21
6 2000-05-12
7 1985-01-11
8 1967-01-31
9 1985-01-23
For a vectorial solution, you can use:
# try to convert to YYYYMMDD
date1 = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
# get rows for which conversion failed
m = date1.isna()
# try to get end of month
date2 = pd.to_datetime(df.loc[m, 'date'].str[:6], format='%Y%m', errors='coerce').add(pd.offsets.MonthEnd())
# Combine both
df['date2'] = date1.fillna(date2)
NB. Assuming df['date'] is of string dtype. If rather of integer dtype, use df.loc[m, 'date'].floordiv(100) in place of df.loc[m, 'date'].str[:6].
Output:
date date2
0 20000123 2000-01-23
1 19850123 1985-01-23
2 19880112 1988-01-12
3 19951201 1995-12-01
4 19850123 1985-01-23
5 20190821 2019-08-21
6 20000512 2000-05-12
7 19850111 1985-01-11
8 19670133 1967-01-31 # invalid replaced by end of month
9 19850123 1985-01-23
Related
I am trying to convert a datetime object to datetime. In the original dataframe the data type is a string and the dataset has shape = (28000000, 26). Importantly, the format of the date is MMYYYY only. Here's a data sample:
DATE
Out[3] 0 081972
1 051967
2 101964
3 041975
4 071976
I tried:
df['DATE'].apply(pd.to_datetime(format='%m%Y'))
and
pd.to_datetime(df['DATE'],format='%m%Y')
I got Runtime Error both times
Then
df['DATE'].apply(pd.to_datetime)
it worked for the other not shown columns(with DDMMYYYY format), but generated future dates with df['DATE'] because it reads the dates as MMDDYY instead of MMYYYY.
DATE
0 1972-08-19
1 2067-05-19
2 2064-10-19
3 1975-04-19
4 1976-07-19
Expect output:
DATE
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 1976-07
If this question is a duplicate please direct me to the original one, I wasn't able to find any suitable answer.
Thank you all in advance for your help
First if error is raised obviously some datetimes not match, you can test it by errors='coerce' parameter and Series.isna, because for not matched values are returned missing values:
print (df)
DATE
0 81972
1 51967
2 101964
3 41975
4 171976 <-changed data
print (pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce'))
0 1972-08-01
1 1967-05-01
2 1964-10-01
3 1975-04-01
4 NaT
Name: DATE, dtype: datetime64[ns]
print (df[pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').isna()])
DATE
4 171976
Solution with output from changed data with converting to datetimes and the to months periods by Series.dt.to_period:
df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)
DATE
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 NaT
Solution with original data:
df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 1976-07
I would have done:
df['date_formatted'] = pd.to_datetime(
dict(
year=df['DATE'].str[2:],
month=df['DATE'].str[:2],
day=1
)
)
Maybe this helps. Works for your sample data.
Comparing today date with date in dataframe
Sample Data
id date
1 1/2/2018
2 1/5/2019
3 5/3/2018
4 23/11/2018
Desired output
id date
2 1/5/2019
4 23/11/2018
My current code
dfdateList = pd.DataFrame()
dfDate= self.df[["id", "date"]]
today = datetime.datetime.now()
today = today.strftime("%d/%m/%Y").lstrip("0").replace(" 0", "")
expList = []
for dates in dfDate["date"]:
if dates <= today:
expList.append(dates)
dfdateList = pd.DataFrame(expList)
Currently my code is printing every single line despite the conditions, can anyone guide me? thanks
Pandas has native support for a large class of operations on datetimes, so one solution here would be to use pd.to_datetime to convert your dates from strings to pandas' representation of datetimes, pd.Timestamp, then just create a mask based on the current date:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df[df['date'] > pd.Timestamp.now()]
For example:
In [34]: df['date'] = pd.to_datetime(df['date'], dayfirst=True)
In [36]: df
Out[36]:
id date
0 1 2018-02-01
1 2 2019-05-01
2 3 2018-03-05
3 4 2018-11-23
In [37]: df[df['date'] > pd.Timestamp.now()]
Out[37]:
id date
1 2 2019-05-01
3 4 2018-11-23
I would like to convert the data type float below to datetime format:
df
Date
0 NaN
1 NaN
2 201708.0
4 201709.0
5 201700.0
6 201600.0
Name: Cred_Act_LstPostDt_U324123, dtype: float64
pd.to_datetime(df['Date'],format='%Y%m.0')
ValueError: time data 201700.0 does not match format '%Y%m.0' (match)
How could I transform these rows without month information as yyyy01 as default?
You can use pd.Series.str.replace to clean up your month data:
s = [x.replace('00.0', '01.0') for x in df['Date'].astype(str)]
df['Date'] = pd.to_datetime(s, format='%Y%m.0', errors='coerce')
print(df)
Date
0 NaT
1 NaT
2 2017-08-01
4 2017-09-01
5 2017-01-01
6 2016-01-01
Create a string that contains the float using .asType(str), then split the string at the fourth char and using cat insert a hyphen. Then you can use format='%Y%m.
However this may still fail if you try to use incorrect month numbering, such as month 00
string = df['Date'].astype(str)
s = pd.Series([string[:4], '-',string[4:6])
date = s.str.cat(sep=',')
pd.to_datetime(date.astype(str),format='%Y%m')
I have 2 columns:
dt_year, dt_month
2014 3
I need a date column.
I tried something like:
pd.to_datetime((df.dt_year + df.dt_month +1).apply(str),format='%Y%m%d')
But I get an error:
ValueError: time data '2014' does not match format '%Y%m%d' (match)
Any ideas?
first, change the column names to something more normal. then add a 'day' column.
df.columns = df.columns.str.replace('dt_', '')
df['day'] = 1
df
year month day
0 2014 3 1
Then the magic happens
pd.to_datetime(df)
0 2014-03-01
dtype: datetime64[ns]
I have a file with two different dates: one has a timestamp and one does not. I need to read the file, disregard the timestamp, and compare the two dates. If the two dates are the same then I need to spit it to the output file and disregard any other rows.
I'm having trouble knowing if I should be using a datetime function on the input and formatting the date there and then simply seeing if the two are equivalent? Or should I be using a timedelta?
I've tried a couple different ways but haven't had success.
df = pd.read_csv("File.csv", dtype={'DATETIMESTAMP': np.datetime64, 'DATE':np.datetime64})
Gives me : TypeError: the dtype < M8 is not supported for parsing, pass this column using parse_dates instead
I've also tried to just remove the timestamp and then compare, but the strings end up with different date formats and that doesn't work either.
df['RemoveTimestamp'] = df['DATETIMESTAMP'].apply(lambda x: x[:10])
df = df[df['RemoveTimestamp'] == df['DATE']]
Any guidance appreciated.
Here is my sample input CSV file:
"DATE", "DATETIMESTAMP"
"8/6/2014","2014-08-06T10:18:38.000Z"
"1/15/2013","2013-01-15T08:57:38.000Z"
"3/7/2013","2013-03-07T16:57:18.000Z"
"12/4/2012","2012-12-04T10:59:37.000Z"
"5/6/2014","2014-05-06T11:07:46.000Z"
"2/13/2013","2013-02-13T15:51:42.000Z"
import pandas as pd
import numpy as np
# your data, both columns are in string
# ================================================
df = pd.read_csv('sample_data.csv')
df
DATE DATETIMESTAMP
0 8/6/2014 2014-08-06T10:18:38.000Z
1 1/15/2013 2013-01-15T08:57:38.000Z
2 3/7/2013 2013-03-07T16:57:18.000Z
3 12/4/2012 2012-12-04T10:59:37.000Z
4 5/6/2014 2014-05-06T11:07:46.000Z
5 2/13/2013 2013-02-13T15:51:42.000Z
# processing
# =================================================
# convert string to datetime
df['DATE'] = pd.to_datetime(df['DATE'])
df['DATETIMESTAMP'] = pd.to_datetime(df['DATETIMESTAMP'])
# cast timestamp to date
df['DATETIMESTAMP'] = df['DATETIMESTAMP'].values.astype('<M8[D]')
DATE DATETIMESTAMP
0 2014-08-06 2014-08-06
1 2013-01-15 2013-01-15
2 2013-03-07 2013-03-07
3 2012-12-04 2012-12-04
4 2014-05-06 2014-05-06
5 2013-02-13 2013-02-13
# compare
df['DATE'] == df['DATETIMESTAMP']
0 True
1 True
2 True
3 True
4 True
5 True
dtype: bool
How about:
import time
filename = dates.csv
with open(filename) as f:
contents = f.readlines()
for i in contents:
date1, date2 = i.split(',')
date1 = date1.strip('"')
date2 = date2.split('T')[0].strip('"')
date1a = time.strftime("%Y-%m-%d",time.strptime(date1, "%m/%d/%Y"))
print i if date1a == date2 else None