Identify invalid dates in pandas dataframe columns

Identify invalid dates in pandas dataframe columns - python

Suppose we had the following dataframe-
How can I create the fourth column 'Invalid dates' as specified below using the first three columns in the dataframe?
Name Date1 Date2 Invalid dates
0 A 01-02-2022 03-04-2000 None
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1, Date2

You can select the Dates column with filter (or any other method, including a manual list), compute a Series of invalid dates by converting to_datetime and sub-selecting the NaN values (i.e. invalid dates) with isna,then stack and join to the original DataFrame:
s = (df
.filter(like='Date') # keep only "Date" columns
# convert to datetime, NaT will be invalid dates
.apply(lambda s: pd.to_datetime(s, format='%d-%m-%Y', errors='coerce'))
.isna()
# reshape to long format (Series)
.stack()
)
out = (df
.join(s[s].reset_index(level=1) # keep only invalid dates
.groupby(level=0)['level_1'] # for all initial indices
.agg(','.join) # join the column names
.rename('Invalid Dates')
)
)
alternative with melt to reshape the DataFrame:
cols = df.filter(like='Date').columns
out = df.merge(
df.melt(id_vars='Name', value_vars=cols, var_name='Invalid Dates')
.assign(value=lambda d: pd.to_datetime(d['value'], format='%d-%m-%Y',
errors='coerce'))
.loc[lambda d: d['value'].isna()]
.groupby('Name')['Invalid Dates'].agg(','.join),
left_on='Name', right_index=True, how='left'
)
output:
Name Date1 Date2 Invalid Dates
0 A 01-02-2022 03-04-2000 NaN
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1,Date2

Use DataFrame.filter for filter columns with substring Date, then convert to datetimes by to_datetime all columns of df1 with errors='coerce' for missing values if no match, so possible test them by DataFrame.isna and by DataFrame.dot extract columnsnames separated by ,:
df1 = df.filter(like='Date')
df['Invalid dates']=((df1.apply(lambda x:pd.to_datetime(x,format='%d-%m-%Y',errors='coerce'))
.isna() & df1.notna())
.dot(df1.columns + ',')
.str[:-1]
.replace('', np.nan))
print (df)
Name Date1 Date2 Invalid dates
0 A 01-02-2022 03-04-2000 NaN
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1,Date2

Related

Date formatting in pandas columns

I have 2 two data frames.
Date thing
201712.0 1
201801.0 2
The Date column is float64 type and I am trying to convert it to date of 12/1/2017 and 1/1/2018 respectively.
Date thing2
12/16/2017 2
1/16/2018 3
The Date column here is object type and I hope to convert to 12/1/2017 and 1/1/2018 as well. The idea here is to do a pd.merge after.

You need:
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m') + pd.Timedelta(days=16)
Output:
Date thing
0 2017-12-16 1
1 2018-01-16 2

Using pandas.to_datetime to convert the 'Date' columns of your original dataframes:
df1 = pd.DataFrame([[201712.0, 1], [201801.0, 2]], columns=["Date", "thing"])
df2 = pd.DataFrame([["12/16/2017", 2], ["1/16/2018", 3]], columns=["Date", "thing2"])
df1['Date'] = pd.to_datetime(df1['Date'].astype(str), format='%Y%m.0')
df2['Date'] = pd.to_datetime(df2['Date']).apply(lambda x : x.replace(day=1))
In the first dataframe, 'Date' column is converted to string type (the .astype(str)) stuff) in order to use a format string.
In the second dataframe, apply function is used to reset the day of the month to the first from whatever it was in the beginning.

How to check and count if value of dataframe one exist in other dataframe?

I have 2 dataframes, added with pd.read_csv. I create dataframe like this:
df1= pd.read_csv('exo.csv', delimiter=';', encoding='latin1', parse_dates=['date'], dayfirst=True)
the 2 datafames are:`
df1:
date number
jan-16
feb-17
march-17
april-17
Df2:
date
09/01/2016
08/02/2017
15/02/2017
13/03/2017
25/08/2017
I would like to check if value of df1.date exists in df2.value. If yes, the column df1['number'] will count the number of appearance. The result of Df1 should then be like this:
date number
jan-16 1
feb-17 2 (=> for instance, feb-17 has found 2 times in Df2['date'])
How can i do this ? do I need to change the date format ?
I thank you in advance,

you need to group by df2.date and then count 2
than you can merge df df2 to df1 by 'date1'
df2['date2'] = pd.to_datetime(df2['date'],format='%d/%m/%Y')
df2['date1'] = df2.date2.dt.strftime('%b-%y').astype(str).str.lower()
b = pd.DataFrame(df2.groupby('date1')['date'].count())
b.columns = ['number']
b = b.reset_index()
then merge
df1['date']=df1.date.str.lower()
df1.merge(b,right_on='date1' , left_on='date',how='left')

Combining similar dataframe rows

I currently have a dataframe which looks like this
User Date FeatureA FeatureB
John DateA 1 2
John DateB 3 5
Is there anyway that I can combine the 2 rows such that it becomes
User Date1 Date2 FeatureA1 FeatureB1 FeatureA2 FeatureB2
John DateA DateB 1 2 3 5

I think need:
g = df.groupby(['User']).cumcount()
df = df.set_index(['User', g]).unstack()
df.columns = ['{}{}'.format(i, j+1) for i, j in df.columns]
df = df.reset_index()
print (df)
User Date1 Date2 FeatureA1 FeatureA2 FeatureB1 FeatureB2
0 John DateA DateB 1 3 2 5
Explanation:
Get count per groups by Users with cumcount
Create MultiIndex by set_index
Reshape by unstack
Flatenning MultiIndex in columns
Convert index to columns by reset_index

Converting DDMMMYYYY to date format in pandas

I have dates in a DataFrame's column like:
1 06AUG2010
2 07APR2011
I want to convert them to a type, where i can count diffrences between dates in days.
I'm searching the internet for the answer, but cant find it. New to pandas.

You can use to_datetime with custom format:
df = pd.DataFrame({'date':['06AUG2010','07APR2011']}, index=[1,2])
print (df)
date
1 06AUG2010
2 07APR2011
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')
print (df)
date
1 2010-08-06
2 2011-04-07
And then for differences add diff:
df['date'] = df['date'].diff()
print (df)
date
1 NaT
2 244 days

Get data into monthly datetime index

I have a pd.dataframe that looks like the one below
Start Date End Date
1/1/1990 7/1/2014
7/1/2005 5/1/2013
8/1/1997 8/1/2004
9/1/2001
I'd like to capture this data where it shows how many items had started but ended by certain months, in a datetimeindex. What I want it to look like is illustrated below.
Date Count
4/1/2013 3
5/1/2013 2
6/1/2013 2
7/1/2013 2
So far I have created a series that creates a string combining the start and finish dates and sums up all items with the same start and end dates.
1/1/19007/1/2014 1
7/1/20055/1/2013 1
8/1/19978/1/2004 1
9/1/2001 1
And I have a dataframe with the datetimeindex looking as follows:
4/1/2013
5/1/2013
6/1/2013
7/1/2013
Now I'm struggling to combine the two to get what I'm looking for. I'm probably thinking about this all wrong and was looking for better ideas.

You can try:
print df1
Start Date End Date
0 1/1/1990 7/1/2014
1 7/1/2005 5/1/2013
2 8/1/1997 8/1/2004
3 9/1/2001 NaN
print df2
Index: [4/1/2013, 5/1/2013, 6/1/2013, 7/1/2013]
#drop NaT in columns Start Date, End Date
df1 = df1.dropna(subset=['Start Date','End Date'])
#convert columns to datetime and then to month period
df1['Start Date'] = pd.to_datetime(df1['Start Date']).dt.to_period('M')
df1['End Date'] = pd.to_datetime(df1['End Date']).dt.to_period('M')
#create new column from datetimeindex and convert it to month period
df2['Date'] = pd.DatetimeIndex(df2.index).to_period('M')
print df1
Start Date End Date
0 1990-01 2014-07
1 2005-07 2013-05
2 1997-08 2004-08
print df2
Date
Date
4/1/2013 2013-04
5/1/2013 2013-05
6/1/2013 2013-06
7/1/2013 2013-07
#stack data for resampling
df1 = df1.stack().reset_index(drop=True, level=1).reset_index(name='Date')
print df1
index Date
0 0 1990-01
1 0 2014-07
2 1 2005-07
3 1 2013-05
4 2 1997-08
5 2 2004-08
#resample by column index
df = df1.groupby(df1['index']).apply(lambda x: x.set_index('Date').resample('1M', how='first')).reset_index(level=1)
#remove unecessary column index
df = df.drop('index', axis=1)
print df.head()
Date
index
0 1990-01
0 1990-02
0 1990-03
0 1990-04
0 1990-05
#merge df and df2 by column Date, groupby by Date and count
print pd.merge(df, df2, on='Date').groupby('Date')['Date'].count()
Date
2013-04 2
2013-05 2
2013-06 1
2013-07 1
Freq: M, Name: Date, dtype: int64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Identify invalid dates in pandas dataframe columns - python

Suppose we had the following dataframe- How can I create the fourth column 'Invalid dates' as specified below using the first three columns in the dataframe? Name Date1 Date2 Invalid dates 0 A 01-02-2022 03-04-2000 None 1 B 23 12-12-2012 Date1 2 C 18-04-1993 abc Date2 3 D 45 qcf Date1, Date2

Related

Date formatting in pandas columns

How to check and count if value of dataframe one exist in other dataframe?

Combining similar dataframe rows

Converting DDMMMYYYY to date format in pandas

Get data into monthly datetime index

Categories

Resources