Suppose we had the following dataframe-
How can I create the fourth column 'Invalid dates' as specified below using the first three columns in the dataframe?
Name Date1 Date2 Invalid dates
0 A 01-02-2022 03-04-2000 None
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1, Date2
You can select the Dates column with filter (or any other method, including a manual list), compute a Series of invalid dates by converting to_datetime and sub-selecting the NaN values (i.e. invalid dates) with isna,then stack and join to the original DataFrame:
s = (df
.filter(like='Date') # keep only "Date" columns
# convert to datetime, NaT will be invalid dates
.apply(lambda s: pd.to_datetime(s, format='%d-%m-%Y', errors='coerce'))
.isna()
# reshape to long format (Series)
.stack()
)
out = (df
.join(s[s].reset_index(level=1) # keep only invalid dates
.groupby(level=0)['level_1'] # for all initial indices
.agg(','.join) # join the column names
.rename('Invalid Dates')
)
)
alternative with melt to reshape the DataFrame:
cols = df.filter(like='Date').columns
out = df.merge(
df.melt(id_vars='Name', value_vars=cols, var_name='Invalid Dates')
.assign(value=lambda d: pd.to_datetime(d['value'], format='%d-%m-%Y',
errors='coerce'))
.loc[lambda d: d['value'].isna()]
.groupby('Name')['Invalid Dates'].agg(','.join),
left_on='Name', right_index=True, how='left'
)
output:
Name Date1 Date2 Invalid Dates
0 A 01-02-2022 03-04-2000 NaN
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1,Date2
Use DataFrame.filter for filter columns with substring Date, then convert to datetimes by to_datetime all columns of df1 with errors='coerce' for missing values if no match, so possible test them by DataFrame.isna and by DataFrame.dot extract columnsnames separated by ,:
df1 = df.filter(like='Date')
df['Invalid dates']=((df1.apply(lambda x:pd.to_datetime(x,format='%d-%m-%Y',errors='coerce'))
.isna() & df1.notna())
.dot(df1.columns + ',')
.str[:-1]
.replace('', np.nan))
print (df)
Name Date1 Date2 Invalid dates
0 A 01-02-2022 03-04-2000 NaN
1 B 23 12-12-2012 Date1
2 C 18-04-1993 abc Date2
3 D 45 qcf Date1,Date2
Related
I have 2 two data frames.
Date thing
201712.0 1
201801.0 2
The Date column is float64 type and I am trying to convert it to date of 12/1/2017 and 1/1/2018 respectively.
Date thing2
12/16/2017 2
1/16/2018 3
The Date column here is object type and I hope to convert to 12/1/2017 and 1/1/2018 as well. The idea here is to do a pd.merge after.
You need:
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m') + pd.Timedelta(days=16)
Output:
Date thing
0 2017-12-16 1
1 2018-01-16 2
Using pandas.to_datetime to convert the 'Date' columns of your original dataframes:
df1 = pd.DataFrame([[201712.0, 1], [201801.0, 2]], columns=["Date", "thing"])
df2 = pd.DataFrame([["12/16/2017", 2], ["1/16/2018", 3]], columns=["Date", "thing2"])
df1['Date'] = pd.to_datetime(df1['Date'].astype(str), format='%Y%m.0')
df2['Date'] = pd.to_datetime(df2['Date']).apply(lambda x : x.replace(day=1))
In the first dataframe, 'Date' column is converted to string type (the .astype(str)) stuff) in order to use a format string.
In the second dataframe, apply function is used to reset the day of the month to the first from whatever it was in the beginning.
I have 2 dataframes, added with pd.read_csv. I create dataframe like this:
df1= pd.read_csv('exo.csv', delimiter=';', encoding='latin1', parse_dates=['date'], dayfirst=True)
the 2 datafames are:`
df1:
date number
jan-16
feb-17
march-17
april-17
Df2:
date
09/01/2016
08/02/2017
15/02/2017
13/03/2017
25/08/2017
I would like to check if value of df1.date exists in df2.value. If yes, the column df1['number'] will count the number of appearance. The result of Df1 should then be like this:
date number
jan-16 1
feb-17 2 (=> for instance, feb-17 has found 2 times in Df2['date'])
How can i do this ? do I need to change the date format ?
I thank you in advance,
you need to group by df2.date and then count 2
than you can merge df df2 to df1 by 'date1'
df2['date2'] = pd.to_datetime(df2['date'],format='%d/%m/%Y')
df2['date1'] = df2.date2.dt.strftime('%b-%y').astype(str).str.lower()
b = pd.DataFrame(df2.groupby('date1')['date'].count())
b.columns = ['number']
b = b.reset_index()
then merge
df1['date']=df1.date.str.lower()
df1.merge(b,right_on='date1' , left_on='date',how='left')
I currently have a dataframe which looks like this
User Date FeatureA FeatureB
John DateA 1 2
John DateB 3 5
Is there anyway that I can combine the 2 rows such that it becomes
User Date1 Date2 FeatureA1 FeatureB1 FeatureA2 FeatureB2
John DateA DateB 1 2 3 5
I think need:
g = df.groupby(['User']).cumcount()
df = df.set_index(['User', g]).unstack()
df.columns = ['{}{}'.format(i, j+1) for i, j in df.columns]
df = df.reset_index()
print (df)
User Date1 Date2 FeatureA1 FeatureA2 FeatureB1 FeatureB2
0 John DateA DateB 1 3 2 5
Explanation:
Get count per groups by Users with cumcount
Create MultiIndex by set_index
Reshape by unstack
Flatenning MultiIndex in columns
Convert index to columns by reset_index
I have dates in a DataFrame's column like:
1 06AUG2010
2 07APR2011
I want to convert them to a type, where i can count diffrences between dates in days.
I'm searching the internet for the answer, but cant find it. New to pandas.
You can use to_datetime with custom format:
df = pd.DataFrame({'date':['06AUG2010','07APR2011']}, index=[1,2])
print (df)
date
1 06AUG2010
2 07APR2011
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')
print (df)
date
1 2010-08-06
2 2011-04-07
And then for differences add diff:
df['date'] = df['date'].diff()
print (df)
date
1 NaT
2 244 days
I have a pd.dataframe that looks like the one below
Start Date End Date
1/1/1990 7/1/2014
7/1/2005 5/1/2013
8/1/1997 8/1/2004
9/1/2001
I'd like to capture this data where it shows how many items had started but ended by certain months, in a datetimeindex. What I want it to look like is illustrated below.
Date Count
4/1/2013 3
5/1/2013 2
6/1/2013 2
7/1/2013 2
So far I have created a series that creates a string combining the start and finish dates and sums up all items with the same start and end dates.
1/1/19007/1/2014 1
7/1/20055/1/2013 1
8/1/19978/1/2004 1
9/1/2001 1
And I have a dataframe with the datetimeindex looking as follows:
4/1/2013
5/1/2013
6/1/2013
7/1/2013
Now I'm struggling to combine the two to get what I'm looking for. I'm probably thinking about this all wrong and was looking for better ideas.
You can try:
print df1
Start Date End Date
0 1/1/1990 7/1/2014
1 7/1/2005 5/1/2013
2 8/1/1997 8/1/2004
3 9/1/2001 NaN
print df2
Index: [4/1/2013, 5/1/2013, 6/1/2013, 7/1/2013]
#drop NaT in columns Start Date, End Date
df1 = df1.dropna(subset=['Start Date','End Date'])
#convert columns to datetime and then to month period
df1['Start Date'] = pd.to_datetime(df1['Start Date']).dt.to_period('M')
df1['End Date'] = pd.to_datetime(df1['End Date']).dt.to_period('M')
#create new column from datetimeindex and convert it to month period
df2['Date'] = pd.DatetimeIndex(df2.index).to_period('M')
print df1
Start Date End Date
0 1990-01 2014-07
1 2005-07 2013-05
2 1997-08 2004-08
print df2
Date
Date
4/1/2013 2013-04
5/1/2013 2013-05
6/1/2013 2013-06
7/1/2013 2013-07
#stack data for resampling
df1 = df1.stack().reset_index(drop=True, level=1).reset_index(name='Date')
print df1
index Date
0 0 1990-01
1 0 2014-07
2 1 2005-07
3 1 2013-05
4 2 1997-08
5 2 2004-08
#resample by column index
df = df1.groupby(df1['index']).apply(lambda x: x.set_index('Date').resample('1M', how='first')).reset_index(level=1)
#remove unecessary column index
df = df.drop('index', axis=1)
print df.head()
Date
index
0 1990-01
0 1990-02
0 1990-03
0 1990-04
0 1990-05
#merge df and df2 by column Date, groupby by Date and count
print pd.merge(df, df2, on='Date').groupby('Date')['Date'].count()
Date
2013-04 2
2013-05 2
2013-06 1
2013-07 1
Freq: M, Name: Date, dtype: int64