I have a dataframe with datetime index. First of all, here is my fake data.
import pandas as pd
data1 = {'date' : ['20190219 093100', '20190219 103200','20190219 171200','20190219 193900','20190219 194500','20190220 093500','20190220 093600'],
'number' : [18.6125, 12.85, 14.89, 15.8301, 15.85, 14.916 , 14.95]}
df1 = pd.DataFrame(data1)
df1 = df1.set_index('date')
df1.index = pd.to_datetime(df1.index).strftime('%Y-%m-%d %H:%M:%S')
What I want to do is to create a new column named "New_column" with categorical variables with 'Yes' or 'No' depends whether if a value in the "number" column is increased at least 20 percent in the same day.
So in this fake data, only the second value "12.85" will be "Yes" because it increased 23.35 percent at the timestamp "2019-02-19 19:45:00"
Even though the first value is 25% greater than the 3rd value, since it happened in the future, it should not be counted.
After the process, I should have NaN in the "New_column" for the last row of each day.
I have been trying many different ways to do it using:
pandas.DataFrame.pct_change
pandas.DataFrame.diff
How can I do this in a Pythonic way?
Initial setup
data = {
'datetime' : ['20190219 093100', '20190219 103200','20190219 171200','20190219 193900','20190219 194500','20190220 093500','20190220 093600'],
'number' : [18.6125, 12.85, 14.89, 15.8301, 15.85, 14.916 , 14.95]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
df = df.sort_values('datetime')
df['date'] = df['datetime'].dt.date
df['New_column'] = 'No'
Find all rows that see a 20% increase later in the same day
indeces_true = set([])
for idx_low, row_low in df.iterrows():
for idx_high, row_high in df.iterrows():
if (row_low['date'] == row_high['date'] and
row_low['datetime'] < row_high['datetime'] and
row_low['number'] * 1.2 < row_high['number']):
indeces_true.add(idx_low)
# Assign 'Yes' for the true rows
for i in indeces_true:
df.loc[i, 'New_column'] = 'Yes'
# Last timestamp every day assigned as NaN
df.loc[df['date'] != df['date'].shift(-1), 'New_column'] = np.nan
# Optionally convert to categorical variable
df['New_column'] = pd.Categorical(df['New_column'])
Output
>>> df
datetime number date New_column
0 2019-02-19 09:31:00 18.6125 2019-02-19 No
1 2019-02-19 10:32:00 12.8500 2019-02-19 Yes
2 2019-02-19 17:12:00 14.8900 2019-02-19 No
3 2019-02-19 19:39:00 15.8301 2019-02-19 No
4 2019-02-19 19:45:00 15.8500 2019-02-19 NaN
5 2019-02-20 09:35:00 14.9160 2019-02-20 No
6 2019-02-20 09:36:00 14.9500 2019-02-20 NaN
Related
I have to pandas dataframe with time index. One it´s a real time series and the other is an ideal time series (all the dates I want). I want to merge them by date, but keeping all values, even the NaN values.
Creating the real time series:
start_date = '2019-01-01'
end_date = '2020-01-01'
rows,cols = 51,1
data = np.random.rand(51)
tidx = pd.date_range(start_date, periods=rows, freq='16d')
data_frame = pd.DataFrame(data, columns=['EVI'], index=tidx)
print(data_frame)
Output:
EVI
2019-01-01 0.395097
2019-01-17 0.300081
2019-02-02 0.080104
......
......
Creating the ideal time series with NaN values (I just want the index)
ideal_tidx = pd.date_range(start_date, end_date, freq='8d')
dummy_data = np.empty((ideal_tidx.size))
dummy_data[:] = np.nan
dummy_data_frame = pd.DataFrame(dummy_data, columns=['EVI'], index=ideal_tidx)
Output:
EVI
2019-01-01 NaN
2019-01-09 NaN
2019-01-17 NaN
....
I want to merge to have something like this:
EVI
2019-01-01 0.395097
2019-01-09 NaN
2019-01-17 0.300081
2019-01-25 NaN
2019-02-02 0.080104
....
Just use outer join and then drop unnecessary column of your dummy dataframe
df = data_frame.merge(dummy_data_frame, left_index =True, rigth_index=True, how='outer',
suffixes = ('', '_dummy'))
df.drop(columns='EVI_dummy', inplace=True)
I have a dataframe (df) with a date index. And I want to achieve the following:
1. Take Dates column and add one month -> e.g. nxt_dt = df.index + np.timedelta64(month=1) and lets call df.index curr_dt
2. Find the nearest entry in Dates that is >= nxt_dt.
3 Count the rows between curr_dt and nxt_dt and put them into a column in df.
The result is supposed to look like this:
px_volume listed_sh ... iv_mid_6m '30d'
Dates ...
2005-01-03 228805 NaN ... 0.202625 21
2005-01-04 189983 NaN ... 0.203465 22
2005-01-05 224310 NaN ... 0.202455 23
2005-01-06 221988 NaN ... 0.202385 20
2005-01-07 322691 NaN ... 0.201065 21
Needless to mention that there are only dates/rows in the df for which there are observations.
I can think of some different ways to get this done in loops, but since the data I work with is quite big, I would really like to avoid to loop through rows to fill them.
Is there a way in pandas to get this done vectorized?
If you are OK to reindex this should do the job:
import numpy as np
import pandas as pd
df = pd.DataFrame({'date': ['2020-01-01', '2020-01-08', '2020-01-24', '2020-01-29', '2020-02-09', '2020-03-04']})
df['date'] = pd.to_datetime(df['date'])
df['value'] = 1
df = df.set_index('date')
df = df.reindex(pd.date_range('2020-01-01','2020-03-04')).fillna(0)
df = df.sort_index(ascending=False)
df['30d'] = df['value'].rolling(30).sum() - 1
df.sort_index().query("value == 1")
gives:
value 30d
2020-01-01 1.0 3.0
2020-01-08 1.0 2.0
2020-01-24 1.0 2.0
2020-01-29 1.0 1.0
2020-02-09 1.0 NaN
2020-03-04 1.0 NaN
Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)
I have a huge dataframe with many columns, many of which are of type datetime.datetime. The problem is that many also have mixed types, including for instance datetime.datetime values and None values (and potentially other invalid values):
0 2017-07-06 00:00:00
1 2018-02-27 21:30:05
2 2017-04-12 00:00:00
3 2017-05-21 22:05:00
4 2018-01-22 00:00:00
...
352867 2019-10-04 00:00:00
352868 None
352869 some_string
Name: colx, Length: 352872, dtype: object
Hence resulting in an object type column. This can be solved with df.colx.fillna(pd.NaT). The problem is that the dataframe is too big to search for individual columns.
Another approach is to use pd.to_datetime(col, errors='coerce'), however this will cast to datetime many columns that contain numerical values.
I could also do df.fillna(float('nan'), inplace=True), though the columns containing dates are still of object type, and would still have the same problem.
What approach could I follow to cast to datetime those columns whose values really do contain datetime values, but could also contain None, and potentially some invalid values (mentioning since otherwise a pd.to_datetime in a try/except clause would do)? Something like a flexible version of pd.to_datetime(col)
This function will set the data type of a column to datetime, if any value in the column matches the regex pattern(\d{4}-\d{2}-\d{2})+ (e.g. 2019-01-01). Credit to this answer on how to Search for String in all Pandas DataFrame columns and filter that helped with setting and applying the mask.
def presume_date(dataframe):
""" Set datetime by presuming any date values in the column
indicates that the column data type should be datetime.
Args:
dataframe: Pandas dataframe.
Returns:
Pandas dataframe.
Raises:
None
"""
df = dataframe.copy()
mask = dataframe.astype(str).apply(lambda x: x.str.match(
r'(\d{4}-\d{2}-\d{2})+').any())
df_dates = df.loc[:, mask].apply(pd.to_datetime, errors='coerce')
for col in df_dates.columns:
df[col] = df_dates[col]
return df
Working from the suggestion to use dateutil, this may help. It is still working on the presumption that if there are any date-like values in a column, that the column should be a datetime. I tried to consider different dataframe iterations methods that are faster. I think this answer on How to iterate over rows in a DataFrame in Pandas did a good job describing them.
Note that dateutil.parser will use the current day or year for any strings like 'December' or 'November 2019' with no year or day values.
import pandas as pd
import datetime
from dateutil.parser import parse
df = pd.DataFrame(columns=['are_you_a_date','no_dates_here'])
df = df.append(pd.Series({'are_you_a_date':'December 2015','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'February 27 2018','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'May 2017 12','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'2017-05-21','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':None,'no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'some_string','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'Processed: 2019/01/25','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'December','no_dates_here':'just a string'}), ignore_index=True)
def parse_dates(x):
try:
return parse(x,fuzzy=True)
except ValueError:
return ''
except TypeError:
return ''
list_of_datetime_columns = []
for row in df:
if any([isinstance(parse_dates(row[0]),
datetime.datetime) for row in df[[row]].values]):
list_of_datetime_columns.append(row)
df_dates = df.loc[:, list_of_datetime_columns].apply(pd.to_datetime, errors='coerce')
for col in list_of_datetime_columns:
df[col] = df_dates[col]
In case you would also like to use the datatime values from dateutil.parser, you can add this:
for col in list_of_datetime_columns:
df[col] = df[col].apply(lambda x: parse_dates(x))
The main problem I see is when parsing numerical values.
I'd propose converting them to strings first
Setup
dat = {
'index': [0, 1, 2, 3, 4, 352867, 352868, 352869],
'columns': ['Mixed', 'Numeric Values', 'Strings'],
'data': [
['2017-07-06 00:00:00', 1, 'HI'],
['2018-02-27 21:30:05', 1, 'HI'],
['2017-04-12 00:00:00', 1, 'HI'],
['2017-05-21 22:05:00', 1, 'HI'],
['2018-01-22 00:00:00', 1, 'HI'],
['2019-10-04 00:00:00', 1, 'HI'],
['None', 1, 'HI'],
['some_string', 1, 'HI']
]
}
df = pd.DataFrame(**dat)
df
Mixed Numeric Values Strings
0 2017-07-06 00:00:00 1 HI
1 2018-02-27 21:30:05 1 HI
2 2017-04-12 00:00:00 1 HI
3 2017-05-21 22:05:00 1 HI
4 2018-01-22 00:00:00 1 HI
352867 2019-10-04 00:00:00 1 HI
352868 None 1 HI
352869 some_string 1 HI
Solution
df.astype(str).apply(pd.to_datetime, errors='coerce')
Mixed Numeric Values Strings
0 2017-07-06 00:00:00 NaT NaT
1 2018-02-27 21:30:05 NaT NaT
2 2017-04-12 00:00:00 NaT NaT
3 2017-05-21 22:05:00 NaT NaT
4 2018-01-22 00:00:00 NaT NaT
352867 2019-10-04 00:00:00 NaT NaT
352868 NaT NaT NaT
352869 NaT NaT NaT
I am looking for a way to create the column 'min_value' from the dataframe df below. For each row i, we subset from the entire dataframe all the records that correspond to the grouping ['Date_A', 'Date_B'] of the row i and having the condition 'Advance' less than 'Advance' of row i, and finally we pick the minimum of the column 'Amount' from this subset to set 'min_value' for the row i:
Initial dataframe:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240]})
df = df [['Date_A', 'Date_B', 'Advance', 'Amount']]
df
Desired output:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240],
'min_value': [180,180,180,230,230,220] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
I wrote the following loop that I think would do the job but it is much too long to run, I guess there must be much more efficient ways to accomplish this.
for i in range(len(df)):
date1=df['Date_A'][i] #select the date A of the row i
date2=df['Date_B'][i] #select the date B of the row i
advance= df['Advance'][i] #select the advance of the row i
df.loc[i,'min_value'] = df[df['Date_A']==date1][df['Date_B']==date2][df['Advance']<advance]['Amount'].min() # subset the entire dataframe to meet dates and advance conditions
df.loc[df['min_value'].isnull(),'min_value']=df['Amount'] # for the smallest advance value, ste min=to its own amount
df
I hope it is clear enough, thanks for your help.
Improvement question
Thanks a lot for the answer. For the last part, the NA rows, I'd like to replace the amount of the row by the overall amount of the Date_A,Date_B,advance grouping so that I have the overall minimum of the last day before date_A
Improvement desired output (two recodrs for the smallest advance value)
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2017-12-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-1-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [5,8,150,5],
'Amount' : [230,220,240,225],
'min_value': [225,230,220,225] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
Thanks
You can use groupby on 'Date_A' and 'Date_B' after sorting the value by 'Advance' and apply the function cummin and shift to the column 'Amount'. Then use fillna with the value from the column 'Amount', such as:
df['min_value'] = (df.sort_values('Advance').groupby(['Date_A','Date_B'])['Amount']
.apply(lambda ser_g: ser_g.cummin().shift()).fillna(df['Amount']))
and you get:
Date_A Date_B Advance Amount min_value
0 2017-12-25 2018-01-01 10 180 180.0
1 2017-12-25 2018-01-01 103 220 180.0
2 2017-12-25 2018-01-01 200 200 180.0
3 2018-01-25 2018-02-01 5 230 230.0
4 2018-01-25 2018-02-01 8 220 230.0
5 2018-01-25 2018-02-01 150 240 220.0