pandas selecting rows for specific time period - python

I have a pandas dataframe with date index. Like this
A B C
date
2021-04-22 2 1 3
2021-05-22 3 2 4
2021-06-22 4 3 5
2021-07-22 5 4 6
2021-08-22 6 5 7
I want to create a new dataframe that selects rows that are only for 2 days previous for a given date. So for example if I give selected = '2021-08-22', what I need is a new dataframe like below
A B C
date
2021-07-22 5 4 6
2021-08-22 6 5 7
can someone please help me with this? Many thanks for your help

You can convert convert the index to DatetimeIndex, then use df[start_date : end_date]
df.index = pd.to_datetime(df.index)
selected = '2021-08-22'
res = df[(pd.to_datetime(selected)-pd.Timedelta(days=2)) : selected]
print(res)
A B C
2021-08-22 6 5 7

I'm assuming that you meant months instead of days.
You can use the df.apply method in order to filter the dataframe rows with a function.
Here is a function that received the inputs you described and returns the new dataframe:
Working example
def filter_df(df, date, num_months):
def diff_month(row):
date1 = datetime.strptime(row["date"], '%Y-%m-%d')
date2 = datetime.strptime(date, '%Y-%m-%d')
return ((date1.year - date2.year) * 12 + date1.month - date2.month)
return df[df.apply(diff_month, axis=1) > - num_months]
print(filter_df(df, "2021-08-22", 2))

Related

Just keep the first value of every minute in pandas dataframe

I want to reduce my data. My initial dataframe looks as follows:
index
time [hh:mm:ss]
value1
value2
0
0 days 00:00:00.000000
3
4
1
0 days 00:00:04.000000
5
2
2
0 days 00:02:02.002300
7
9
3
0 days 00:02:03.000000
9
7
4
0 days 03:02:03.000000
4
3
Now I want to reduce my data in order to only keep the cells of every new minute (respectively also new hour and days). the other way around: only the first line of a new minute should be kept. all remaining lines of this minute should be dropped.
So the resulting table looks as follows:
index
time
value1
value2
0
0 days 00:00:00.000000
3
4
2
0 days 00:02:02.002300
7
9
4
0 days 03:02:03.000000
4
3
Any ideas how to approach this?
There is used timedeltas so is possible create TimedeltaIndex and use DataFrame.resample by 1Minute with Resampler.first, only are added all minutes, so removed only NaNs rows:
df.index = pd.to_timedelta(df['time [hh:mm:ss]'])
df = df.resample('1Min').first().dropna(how='all').reset_index(drop=True)
print (df)
time [hh:mm:ss] value1 value2
0 0 days 00:00:00.000000 3.0 4.0
1 0 days 00:02:02.002300 7.0 9.0
2 0 days 03:02:03.000000 4.0 3.0
You could extract the D:HH:MM using apply and multiple splits, and then delete the duplicates, choosing the first value.
dms = df['time [hh:mm:ss]'].apply(lambda x: ':'.join( [x.split(' days ')[0], *x.split('days ')[1].split(':')[:2]]) )
df.iloc[dms.drop_duplicates().index]
d = '''index,time,value1,value2 0,0 days 00:00:00.000000,3,4 1,0 days
00:00:04.000000,5,2 2,0 days 00:02:02.002300,7,9 3,0 days
00:02:03.000000,9,7 4,0 days 03:02:03.000000,4,3'''
df = pd.read_csv(StringIO(d),parse_dates=True)
df
df['time1'] = pd.to_datetime(df['time'].str.slice(7))
df.set_index('time1',inplace=True)
df
df.groupby([df.index.hour,df.index.minute]).head(1).sort_index().reset_index(drop=True)

How to group and aggregate data starting from constant and ending on changing date? [duplicate]

This question already has an answer here:
How to count unique occurrences grouping by changing time period?
(1 answer)
Closed 1 year ago.
I need to aggregate data between constant date, like first day of year, and all the other dates through the year. There are two variants of this problem:
easier - sum:
created_at value
01-01-2012 5
02-01-2012 6
05-01-2012 1
05-01-2012 1
01-02-2012 3
02-02-2012 2
05-02-2012 1
which should output:
Date Month to date sum Year to date sum
01-01-2012 5 5
02-01-2012 11 11
05-01-2012 13 13
01-02-2012 3 14
02-02-2012 5 15
05-02-2012 6 16
and harder - count unique:
created_at value
01-01-2012 a
02-01-2012 b
05-01-2012 c
05-01-2012 c
01-02-2012 a
02-02-2012 a
05-02-2012 d
which should output:
Date Month to date unique Year to date unique
01-01-2012 1 1
02-01-2012 2 2
05-01-2012 3 3
01-02-2012 1 3
02-02-2012 1 3
05-02-2012 2 4
The data is, of course, in Pandas data frame.The obvious, but very clumsy way is to create for loop between the starting dates and the moving one. The problem looks like a popular one. Is there some reasonable pandas builtin way for such type of computation? Regarding counting unique I also want to avoid stacking lists, as I have large number of rows and unique values.
I was checking out Pandas window functions, but it doesn't look like a solution.
Try with groupby:
Cumulative sum:
df["created_at"] = pd.to_datetime(df["created_at"], format="%d-%m-%Y")
df["Month to date sum"] = df.groupby(df["created_at"].dt.month)["value"].transform('cumsum')
df["Year to date sum"] = df.groupby(df["created_at"].dt.year)["value"].transform('cumsum')
>>> df
created_at value Month to date sum Year to date sum
0 2012-01-01 5 5 5
1 2012-01-02 6 11 11
2 2012-01-05 1 12 12
3 2012-02-01 3 3 15
4 2012-02-02 2 5 17
5 2012-02-05 1 6 18
Cumulative unique count:
df2["created_at"] = pd.to_datetime(df2["created_at"], format="%d-%m-%Y")
df2["Month to date unique"] = df2.groupby(df2["created_at"].dt.month)["value"].apply(lambda x: (~x.duplicated()).cumsum())
df2["Year to date unique"] = df2.groupby(df2["created_at"].dt.year)["value"].apply(lambda x: (~x.duplicated()).cumsum())
>>> df2
created_at value Month to date unique Year to date unique
0 2012-01-01 a 1 1
1 2012-01-02 b 2 2
2 2012-01-05 c 3 3
3 2012-02-01 a 1 3
4 2012-02-02 a 1 3
5 2012-02-05 d 2 4

"Dynamic" column selection

The problem:
The input table, let's say, is a merged table of calls and bills, having columns: TIME of the call and months of all the bills. The idea is to have a table that has the last 3 bills the person paid starting from the time of the call. That way putting the bills in context of the call.
The Example input and output:
# INPUT:
# df
# TIME ID 2019-08-01 2019-09-01 2019-10-01 2019-11-01 2019-12-01
# 2019-12-01 1 1 2 3 4 5
# 2019-11-01 2 6 7 8 9 10
# 2019-10-01 3 11 12 13 14 15
# EXPECTED OUTPUT:
# df_context
# TIME ID 0 1 2
# 2019-12-01 1 3 4 5
# 2019-11-01 2 7 8 9
# 2019-10-01 3 11 12 13
EXAMPLE INPUT CREATION:
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
The code I have got so far:
# HOW DOES ONE GET THE col_to FOR EVERY ROW?
col_to = df.columns.get_loc(df['TIME'].astype(str).values[0])
col_from = col_to - 3
df_context = pd.DataFrame()
df_context = df_context.append(pd.DataFrame(df.iloc[:, col_from : col_to].values))
df_context["TIME"] = df["TIME"]
cols = df_context.columns.tolist()
df_context = df_context[cols[-1:] + cols[:-1]]
df_context.head()
OUTPUT of my code:
# OUTPUTS:
# TIME 0 1 2
# 0 2019-12-01 2 3 4 should be 3 4 5
# 1 2019-11-01 7 8 9 all good
# 2 2019-10-01 12 13 14 should be 11 12 13
What my code seems to lack if a for loop or two, for the first two lines of code, to do waht I want it to do, but I just can't believe that there isn't a better a solution than the one I am concocting right now.
I would suggest the following steps so that you can avoid dynamic column selection altogether.
Convert the wide table (reference date as columns) to a long table (reference date as rows)
Compute the difference in months between time of the call TIME and reference date
Select only those with difference >= 0 and difference < 3
Format the output table (add a running number, pivot it) according to your requirements
# Initialize dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
# Convert the wide table to a long table by melting the date columns
# Name the new date column as REF_TIME, and the bill column as BILL
date_cols = ['2019-08-01', '2019-09-01', '2019-10-01', '2019-11-01', '2019-12-01']
df = df.melt(id_vars=['TIME','ID'], value_vars=date_cols, var_name='REF_TIME', value_name='BILL')
# Convert TIME and REF_TIME to datetime type
df['TIME'] = pd.to_datetime(df['TIME'])
df['REF_TIME'] = pd.to_datetime(df['REF_TIME'])
# Find out difference between TIME and REF_TIME
df['TIME_DIFF'] = (df['TIME'] - df['REF_TIME']).dt.days
df['TIME_DIFF'] = (df['TIME_DIFF'] / 30).round()
# Keep only the preceding 3 months (including the month = TIME)
selection = (
(df['TIME_DIFF'] < 3) &
(df['TIME_DIFF'] >= 0)
)
# Apply selection, sort the columns and keep only columns needed
df_out = (
df[selection]
.sort_values(['TIME','ID','REF_TIME'])
[['TIME','ID','BILL']]
)
# Add a running number, lets call this BILL_NO
df_out = df_out.assign(BILL_NO = df_out.groupby(['TIME','ID']).cumcount() + 1)
# Pivot the output table to the format needed
df_out = df_out.pivot(index=['ID','TIME'], columns='BILL_NO', values='BILL')
Output:
BILL_NO 1 2 3
ID TIME
1 2019-12-01 3 4 5
2 2019-11-01 7 8 9
3 2019-10-01 11 12 13
Here is my (newbie's) solution, it's gonna work only if the dates in column names are in ascending order:
# Initializing Dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],})
cols = list(df.columns)
new_df = pd.DataFrame([], columns=["0","1","2"])
# Iterating over rows, selecting desired slices and appending them to a new DataFrame:
for i in range(len(df)):
searched_date = df.iloc[i, 0]
searched_column_index = cols.index(searched_date)
searched_row = df.iloc[[i], searched_column_index-2:searched_column_index+1]
mapping_column_names = {searched_row.columns[0]: "0", searched_row.columns[1]: "1", searched_row.columns[2]: "2"}
searched_df = searched_row.rename(mapping_column_names, axis=1)
new_df = pd.concat([new_df, searched_df], ignore_index=True)
new_df = pd.merge(df.iloc[:,0:2], new_df, left_index=True, right_index=True)
new_df
Output:
TIME ID 0 1 2
0 2019-12-01 1 3 4 5
1 2019-11-01 2 7 8 9
2 2019-10-01 3 11 12 13
Anyway I think #Toukenize solution is better since it doesn't require iterating.

Set particular value for a month of data based on column in dataframe

I have a dataframe made up of daily data across a number of columns;
A B C D
01/01/2020 12 3 2 1
02/01/2020 8 14 5 1
03/01/2020 45 4 1 3
.
.
.
.
31/12/2021 5 1 5 3
The data is generated automatically but I would to be able to overwrite data by month or by date.
I understand something like this could reset a value but is there anyway to do it in bulk by month or between two certain dates?
df.set_value('C', 'x', 10)
Any help much appreciated!
Create DatetimeIndex first and the set values in DataFrame.loc, also here working partialy string indexing for set values of month:
df.index = pd.to_datetime(df.index, dayfirst=True)
df.loc['2020-01-02','C'] = 100
df.loc['2020-01','B'] = 500
df.loc['2020-01-01':'2020-01-02','A'] = 0
#select multiple columns by list
df.loc['2020-01-03':'2021-12-31', ['C','D']] = 1000
print (df)
A B C D
2020-01-01 0 500 2 1
2020-01-02 0 500 100 1
2020-01-03 45 500 1000 1000
2021-12-31 5 1 1000 1000

Is it possible to use vectorization for a conditionnal count of rows in a Pandas Dataframe?

I have a Pandas Dataframe with data about calls. Each call has a unique ID and each customer has an ID (but can have multiple Calls). A third column gives a day. For each customer I want to calculate the maximum number of calls made in a period of 7 days.
I have been using the following code to count the number of calls within 7 days of the call on each row:
df['ContactsIN7Days'] = df.apply(lambda row: len(df[(df['PersonID']==row['PersonID']) & (abs(df['Day'] - row['Day']) <=7)]), axis=1)
Output:
CallID Day PersonID ContactsIN7Days
6 2 3 2
3 14 2 2
1 8 1 1
5 1 3 2
2 12 2 2
7 100 3 1
This works, however this is going to be applied on a big data set. Would there be a way to make this more efficient. Through vectorization?
IIUC this is a convoluted, but I think effective solution to your issue. Note that the order of your dataframe is modified as a result, and that your Day column is modified to a timedelta dtype:
Starting from your dataframe df:
CallID Day PersonID
0 6 2 3
1 3 14 2
2 1 8 1
3 5 1 3
4 2 12 2
5 7 100 3
Start by modifying Day to a timedelta series:
df['Day'] = pd.to_timedelta(df['Day'], unit='d')
Then, use pd.merge_asof, to merge your dataframe with the count of calls by each individual in a period of 7 days. To get this, use groupby with a pd.Grouper with a frequency of 7 days:
new_df = (pd.merge_asof(df.sort_values(['Day']),
df.sort_values(['Day'])
.groupby([pd.Grouper(key='Day', freq='7d'), 'PersonID'])
.size()
.to_frame('ContactsIN7Days')
.reset_index(),
left_on='Day', right_on='Day',
left_by='PersonID', right_by='PersonID',
direction='nearest'))
Your resulting new_df will look like this:
CallID Day PersonID ContactsIN7Days
0 5 1 days 3 2
1 6 2 days 3 2
2 1 8 days 1 1
3 2 12 days 2 2
4 3 14 days 2 2
5 7 100 days 3 1

Categories