I want to extract the closing balance for the week across different date from the below dataframe
Date Week Balance
2017-02-12 6 50000.46
2017-02-12 6 49531.46
2017-02-12 6 48108.46
2017-05-12 19 21558.96
2017-08-12 32 21561.1
2018-02-05 6 2816.20
2018-02-06 6 78.53
2018-02-07 6 39.53
2018-08-12 32 21561.1
Expected output is:
Date Week Balance
2017-02-12 6 48108.46
2017-05-12 19 21558.96
2018-02-07 6 39.53
2018-08-12 32 21561.1
I tried to use the .last() attribute of groupby function but I get multiple returns for the same week
weekly = df.groupby(["Transaction Date",'Week']).last().Balance
weekly
Date. week Balance
2017-02-12 6 48108.46
2017-03-12 10 46802.46
2017-04-12 15 39588.46
2017-05-12 19 21558.96
2018-02-03 5 24699.73
2018-02-04 5 103.20
2018-02-05 6 2816.20
2018-02-06 6 78.53
2018-02-07 6 39.53
You can use shift to check for consecutive rows and keep the last one:
df.loc[df['Week'] != df['Week'].shift(-1)]
Output:
| | Date | Week | Balance |
|---:|:-----------|-------:|----------:|
| 2 | 2017-02-12 | 6 | 48108.46 |
| 3 | 2017-05-12 | 19 | 21558.96 |
| 4 | 2017-08-12 | 32 | 21561.10 |
| 7 | 2018-02-07 | 6 | 39.53 |
| 8 | 2018-08-12 | 32 | 21561.10 |
Related
I need to find the start and end date of the previous 12 months from the current date.
If the current date is 05-May-2022
Then, it should display past 12 months first date and last date for all months including current month.
how to achieve this as each month have a different number of days? Do we have any function in datetime to achieve this?
code only display previous month first and last date, so i want to print previous 12 months
from datetime import date, timedelta
this_first = date.today().replace(day=1)
prev_last = this_first - timedelta(days=1)
prev_first = prev_last.replace(day=1)
prev_first, prev_last
Output:
(datetime.date(2021, 1, 1), datetime.date(2021, 1, 31))
Expected Output:
[('2021-06-01', '2021-06-30'), ('2021-07-01', '2021-07-31'),
('2021-08-01', '2021-08-31'), ('2021-09-01', '2021-09-30'),
('2021-10-01', '2021-10-31'), ('2021-11-01', '2021-11-30'),
('2021-12-01', '2021-12-31'), ('2022-01-01', '2022-01-31'),
('2022-02-01', '2022-02-28'), ('2022-03-01', '2022-03-31'),
('2022-04-01', '2022-04-30'), ('2022-05-01', '2022-05-31')]
Note:
dtype should be only in datetime.
You can use current_date.replace(day=1) to get first day in current month.
And if you substract datetime.timedelta(days=1) then you get last day in previous month.
And you can use again replace(day=1) to get first day in previous month.
If you repeate it in loop then you can get first day and last day for all 12 months.
import datetime
current = datetime.datetime(2022, 5, 5)
start = current.replace(day=1)
for x in range(1, 13):
end = start - datetime.timedelta(days=1)
start = end.replace(day=1)
print(f'{x:2} |', start.date(), '|', end.date())
Result:
1 | 2022-04-01 | 2022-04-30
2 | 2022-03-01 | 2022-03-31
3 | 2022-02-01 | 2022-02-28
4 | 2022-01-01 | 2022-01-31
5 | 2021-12-01 | 2021-12-31
6 | 2021-11-01 | 2021-11-30
7 | 2021-10-01 | 2021-10-31
8 | 2021-09-01 | 2021-09-30
9 | 2021-08-01 | 2021-08-31
10 | 2021-07-01 | 2021-07-31
11 | 2021-06-01 | 2021-06-30
12 | 2021-05-01 | 2021-05-31
EDIT:
And if you use pandas then you can use pd.date_range() but it can't for previous dates so you would have to first get '2021.04.05' (for MS) and '2021.05.05' (for M)
import pandas as pd
#all_starts = pd.date_range('2021.04.05', '2022.04.05', freq='MS')
all_starts = pd.date_range('2021.04.05', periods=12, freq='MS')
print(all_starts)
#all_ends = pd.date_range('2021.05.05', '2022.05.05', freq='M')
all_ends = pd.date_range('2021.05.05', periods=12, freq='M')
print(all_ends)
for start, end in zip(all_starts, all_ends):
print(start.to_pydatetime().date(), '|', end.to_pydatetime().date())
DatetimeIndex(['2021-05-01', '2021-06-01', '2021-07-01', '2021-08-01',
'2021-09-01', '2021-10-01', '2021-11-01', '2021-12-01',
'2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01'],
dtype='datetime64[ns]', freq='MS')
DatetimeIndex(['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31',
'2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31',
'2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30'],
dtype='datetime64[ns]', freq='M')
2021-05-01 | 2021-05-31
2021-06-01 | 2021-06-30
2021-07-01 | 2021-07-31
2021-08-01 | 2021-08-31
2021-09-01 | 2021-09-30
2021-10-01 | 2021-10-31
2021-11-01 | 2021-11-30
2021-12-01 | 2021-12-31
2022-01-01 | 2022-01-31
2022-02-01 | 2022-02-28
2022-03-01 | 2022-03-31
2022-04-01 | 2022-04-30
EDIT:
I found out that standard module calendar can gives number of days and weeks in month.
weeks, days = calendar.monthrange(year, month)
Working example:
import calendar
year = 2022
month = 5
for number in range(1, 13):
if month > 1:
month -= 1
else:
month = 12
year -= 1
weeks, days = calendar.monthrange(year, month)
print(f'{number:2} | {year}.{month:02}.01 | {year}.{month:02}.{days}')
Result:
1 | 2022.04.01 | 2022.04.30
2 | 2022.03.01 | 2022.03.31
3 | 2022.02.01 | 2022.02.28
4 | 2022.01.01 | 2022.01.31
5 | 2021.12.01 | 2021.12.31
6 | 2021.11.01 | 2021.11.30
7 | 2021.10.01 | 2021.10.31
8 | 2021.09.01 | 2021.09.30
9 | 2021.08.01 | 2021.08.31
10 | 2021.07.01 | 2021.07.31
11 | 2021.06.01 | 2021.06.30
12 | 2021.05.01 | 2021.05.31
So this is kinda on the crazy side of problems so I apologize in advance.... What I am trying to accomplish is the ability to read the oldest date from a CSV File, Compare it to today's date, and if the difference between the two is equal to or greater than 55, it will delete rows using Pandas until the condition is met.
I have tried a few different ways through using df.drop() however, the closest I have come to getting it is as follows in the code.
Also, here is the numbers from the testFile.csv I am using. (Everything in CSV File is made of strings)
2019-05-01 | 14
2019-05-02 | 16
2019-05-03 | 2
2019-05-04 | 3
2019-05-05 | 3
2019-05-06 | 6
2019-05-07 | 14
2019-05-08 | 8
2019-05-09 | 5
2019-05-10 | 1
2019-05-11 | 5
2019-05-12 | 4
2019-05-13 | 1
2019-05-14 | 2
2019-05-15 | 3
2019-05-16 | 8
2019-05-17 | 2
2019-05-18 | 3
2019-05-19 | 4
2019-05-20 | 4
import datetime, time
import pandas as pd
GLOBAL_PATH = r'C:\Users\DArthur\Documents'
pattern = '%Y-%m-%d' # CSV Pattern
el_pattern = '%m/%d/%Y:00:00:00' # Required Pattern by Splunk for search_query, used for TimeStamp
def remove_old_data(csv_file):
df = pd.read_csv(GLOBAL_PATH + csv_file, sep=',', index_col=0, encoding='utf-8', low_memory=False)
s = pd.Series(pd.to_datetime('today') - pd.to_datetime(df.index[0])).dt.days # Calculate the date difference
print(s[0], type(s[0]), type(s)) # Result -- 57 <class 'numpy.int64'> <class 'pandas.core.series.Series'>
df[s.le(55)]#.reset_index(drop=True).to_csv(csv_file, index=False)
print(df)
if __name__ == '__main__':
# get_last_date('/testFile.csv')
remove_old_data('/testFile.csv')
Since the CSV File's oldest date is 57 days from today, the first two rows should be removed from the file. Thus when the file is opened after the program is run its first row starts with 2019-05-03 | 2.
Any help or pointing in the right direction is greatly appreciated. :)
IIUC, use:
s=(pd.to_datetime('today')-pd.to_datetime(df.date)).dt.days
df[s.le(40)]#.reset_index(drop=True).to_csv(file,index=False)
date count
3 2019-05-04 3
4 2019-05-05 3
5 2019-05-06 6
6 2019-05-07 14
7 2019-05-08 8
8 2019-05-09 5
9 2019-05-10 1
10 2019-05-11 5
11 2019-05-12 4
12 2019-05-13 1
13 2019-05-14 2
14 2019-05-15 3
15 2019-05-16 8
16 2019-05-17 2
17 2019-05-18 3
18 2019-05-19 4
19 2019-05-20 4
I want to get a count & sum of values over +/- 7 days period of a column after the dataframe being grouped to certain column
Example data (edited to reflect my real dataset):
group | date | amount
-------------------------------------------
A | 2017-12-26 04:20:20 | 50000.0
A | 2018-01-17 00:54:15 | 60000.0
A | 2018-01-27 06:10:12 | 150000.0
A | 2018-02-01 01:15:06 | 100000.0
A | 2018-02-11 05:05:34 | 150000.0
A | 2018-03-01 11:20:04 | 150000.0
A | 2018-03-16 12:14:01 | 150000.0
A | 2018-03-23 05:15:07 | 150000.0
A | 2018-04-02 10:40:35 | 150000.0
group by group then sum based on date-7 < date < date+7
Results that I want:
group | date | amount | grouped_sum
-----------------------------------------------------------
A | 2017-12-26 04:00:00 | 50000.0 | 50000.0
A | 2018-01-17 00:00:00 | 60000.0 | 60000.0
A | 2018-01-27 06:00:00 | 150000.0 | 250000.0
A | 2018-02-01 01:00:00 | 100000.0 | 250000.0
A | 2018-02-11 05:05:00 | 150000.0 | 150000.0
A | 2018-03-01 11:00:04 | 150000.0 | 150000.0
A | 2018-03-16 12:00:01 | 150000.0 | 150000.0
A | 2018-03-23 05:00:07 | 100000.0 | 100000.0
A | 2018-04-02 10:00:00 | 100000.0 | 100000.0
Quick snippet to achieve the dataset:
group = 9 * ['A']
date = pd.to_datetime(['2017-12-26 04:20:20', '2018-01-17 00:54:15',
'2018-01-27 06:10:12', '2018-02-01 01:15:06',
'2018-02-11 05:05:34', '2018-03-01 11:20:04',
'2018-03-16 12:14:01', '2018-03-23 05:15:07',
'2018-04-02 10:40:35'])
amount = [50000.0, 60000.0, 150000.0, 100000.0, 150000.0,
150000.0, 150000.0, 150000.0, 150000.0]
df = pd.DataFrame({'group':group, 'date':date, 'amount':amount})
Bit of explanation:
2nd row is 40 because it sums data for A in period 2018-01-14 and 2018-01-15
4th row is 30 because it sums data for B in period 2018-01-03 + next 7 days
6th row is 30 because it sums data for B in period 2018-01-03 + prev 7 days.
I dont have any idea how to do sum over a period of date range. I might be able to do it if I make this way:
1.Create another column that shows date-7 and date+7 for each rows
group | date | amount | date-7 | date+7
-------------------------------------------------------------
A | 2017-12-26 | 50000.0 | 2017-12-19 | 2018-01-02
A | 2018-01-17 | 60000.0 | 2018-01-10 | 2018-01-24
2.calculate amount between the date range: df[df.group == 'A' & df.date > df.date-7 & df.date < df.date+7].amount.sum()
3.But this method is quite tedious.
EDIT (2018-09-01):
Found out this method below based on #jezrael answer which works for me but only works for single group:
t = pd.Timedelta(7, unit='d')
def g(row):
res = df[(df.created > row.created - t) & (df.created < row.created + t)].amount.sum()
return res
df['new'] = df.apply(g, axis=1)
Here is problem need loop for each row and for each groups:
t = pd.Timedelta(7, unit='d')
def f(x):
return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t,
y['date'] + t,
inclusive=False),'amount'].sum() ,axis=1)
df['new'] = df.groupby('group', group_keys=False).apply(f)
print (df)
group date amount new
0 A 2018-01-01 10 10.0
1 A 2018-01-14 20 40.0
2 A 2018-01-15 20 40.0
3 B 2018-02-03 10 30.0
4 B 2018-02-04 10 30.0
5 B 2018-02-05 10 30.0
Thanks for improvement by #jpp:
def f(x, t):
return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t,
y['date'] + t,
inclusive=False),'amount'].sum(),axis=1)
df['new'] = df.groupby('group', group_keys=False).apply(f, pd.Timedelta(7, unit='d'))
Verify solution:
t = pd.Timedelta(7, unit='d')
df = df[df['group'] == 'A']
def test(y):
a = df.loc[df['date'].between(y['date'] - t, y['date'] + t,inclusive=False)]
print (a)
print (a['amount'])
return a['amount'].sum()
group date amount
0 A 2018-01-01 10
0 10
Name: amount, dtype: int64
group date amount
1 A 2018-01-14 20
2 A 2018-01-15 20
1 20
2 20
Name: amount, dtype: int64
group date amount
1 A 2018-01-14 20
2 A 2018-01-15 20
1 20
2 20
Name: amount, dtype: int64
df['new'] = df.apply(test,axis=1)
print (df)
group date amount new
0 A 2018-01-01 10 10
1 A 2018-01-14 20 40
2 A 2018-01-15 20 40
Add column with first days of the week:
df['week_start'] = df['date'].dt.to_period('W').apply(lambda x: x.start_time)
Result:
group date amount week_start
0 A 2018-01-01 10 2017-12-26
1 A 2018-01-14 20 2018-01-09
2 A 2018-01-15 20 2018-01-09
3 B 2018-02-03 10 2018-01-30
4 B 2018-02-04 10 2018-01-30
5 B 2018-02-05 10 2018-01-30
Group by new column and find weekly total amount:
grouped_sum = df.groupby('week_start')['amount'].sum().reset_index()
Result:
week_start amount
0 2017-12-26 10
1 2018-01-09 40
2 2018-01-30 30
Merge dataframes on week_start:
pd.merge(df.drop('amount', axis=1), grouped_sum, on='week_start').drop('week_start', axis=1)
Result:
group date amount
0 A 2018-01-01 10
1 A 2018-01-14 40
2 A 2018-01-15 40
3 B 2018-02-03 30
4 B 2018-02-04 30
5 B 2018-02-05 30
How can I perform the below manipulation with pandas?
I have this dataframe :
weight | Date | dateDay
43 | 09/03/2018 08:48:48 | 09/03/2018
30 | 10/03/2018 23:28:48 | 10/03/2018
45 | 12/03/2018 04:21:44 | 12/03/2018
25 | 17/03/2018 00:23:32 | 17/03/2018
35 | 18/03/2018 04:49:01 | 18/03/2018
39 | 19/03/2018 20:14:37 | 19/03/2018
I want this :
weight | Date | dateDay | Fun_Cum
43 | 09/03/2018 08:48:48 | 09/03/2018 | NULL
30 | 10/03/2018 23:28:48 | 10/03/2018 | -13
45 | 12/03/2018 04:21:44 | 12/03/2018 | NULL
25 | 17/03/2018 00:23:32 | 17/03/2018 | NULL
35 | 18/03/2018 04:49:01 | 18/03/2018 | 10
39 | 19/03/2018 20:14:37 | 19/03/2018 | 4
Pseudo code:
If Day does not follow Day-1 => Fun_Cum is NULL;
Else (weight day) - (weight day-1)
Thank you
This is one way using pd.Series.diff and pd.Series.shift. You can take the difference between consecutive datetime elements and access pd.Series.dt.days attribute.
df['Fun_Cum'] = df['weight'].diff()
df.loc[(df.dateDay - df.dateDay.shift()).dt.days != 1, 'Fun_Cum'] = np.nan
print(df)
weight Date dateDay Fun_Cum
0 43 2018-03-09 2018-03-09 NaN
1 30 2018-03-10 2018-03-10 -13.0
2 45 2018-03-12 2018-03-12 NaN
3 25 2018-03-17 2018-03-17 NaN
4 35 2018-03-18 2018-03-18 10.0
5 39 2018-03-19 2018-03-19 4.0
#import pandas as pd
#from datetime import datetime
#to_datetime = lambda d: datetime.strptime(d, '%d/%m/%Y')
#df = pd.read_csv('d.csv', converters={'dateDay': to_datetime})
Above part only if you reading from the file, else its just .shift() what u need
a = df
b = df.shift()
df["Fun_Cum"] = (a.weight - b.weight) * ((a.dateDay - b.dateDay).dt.days ==1)
I have a data set that looks like this:
Date | ID | Task | Description
2016-01-06 00:00:00 | 1 | 010 | This is text
2016-01-06 00:10:00 | 1 | 020 | This is text
2016-01-06 00:20:00 | 1 | 010 | This is text
2016-01-06 01:00:00 | 1 | 020 | This is text
2016-01-06 01:10:00 | 1 | 030 | This is text
2016-02-06 00:00:00 | 2 | 010 | This is text
2016-02-06 00:10:00 | 2 | 020 | This is text
2016-02-06 00:20:00 | 2 | 010 | This is text
2016-02-06 01:00:00 | 2 | 020 | This is text
2016-02-06 01:01:00 | 2 | 030 | This is text
Task 020usually occurs after task 010 which means that when Task 020 starts means that task 010 ends, same applies for Task 020, if it comes before any other Task it means that it has stopped.
I need to group by Task calculating the average duration, total sum and count of each type of task in each ID, so I am looking for something like this:
ID | Task | Average | Sum | Count
1 | 010 | 25 | 50 | 2
1 | 020 | 10 | 20 | 2
etc | etc | etc | etc | etc
There are more IDs but I only care about 010 and 020, so whatever number is returned from them is acceptable.
Can someone help me on how to do this in Python?
I think it's a simple .groupby() that you need. You sample output doesn't show any complicated linking between timestamps and Task or ID
df['count'] = df.groupby(['ID','Task']).size()
will give you the count of each unique ID/Task in your data. To do a sum or average, it's similar, but you need a column with something to sum.
See here for more details.
It seems you need agg with groupby, but in sample not numeric column so col was added:
print (df)
Date ID Task Description col
0 2016-01-06 00:00:00 1 010 This is text 1
1 2016-01-06 00:10:00 1 020 This is text 2
2 2016-01-06 00:20:00 1 010 This is text 6
3 2016-01-06 01:00:00 1 020 This is text 1
4 2016-01-06 01:10:00 1 030 This is text 3
5 2016-02-06 00:00:00 2 010 This is text 1
6 2016-02-06 00:10:00 2 020 This is text 8
7 2016-02-06 00:20:00 2 010 This is text 9
8 2016-02-06 01:00:00 2 020 This is text 1
df = df.groupby(['ID','Task'])['col'].agg(['sum','size', 'mean']).reset_index()
print (df)
ID Task sum size mean
0 1 010 7 2 3.5
1 1 020 3 2 1.5
2 1 030 3 1 3.0
3 2 010 10 2 5.0
4 2 020 9 2 4.5
If need aggreagte datetime, id is a bit complicated, because need timedeltas:
df.Date = pd.to_timedelta(df.Date).dt.total_seconds()
df = df.groupby(['ID','Task'])['Date']
.agg(['sum','size', 'mean']).astype(np.int64).reset_index()
df['sum'] = pd.to_timedelta(df['sum'])
df['mean'] = pd.to_timedelta(df['mean'])
print (df)
ID Task sum size mean
0 1 010 00:00:02.904078 2 00:00:01.452039
1 1 020 00:00:02.904081 2 00:00:01.452040
2 1 030 00:00:01.452042 1 00:00:01.452042
3 2 010 00:00:02.909434 2 00:00:01.454717
4 2 020 00:00:02.909437 2 00:00:01.454718
For finding difference in column date:
print (df.Date.dtypes)
object
#if dtype of column is not datetime, first convert
df.Date = pd.to_datetime(df.Date )
print (df.Date.diff())
0 NaT
1 0 days 00:10:00
2 0 days 00:10:00
3 0 days 00:40:00
4 0 days 00:10:00
5 30 days 22:50:00
6 0 days 00:10:00
7 0 days 00:10:00
8 0 days 00:40:00
9 0 days 00:01:00
Name: Date, dtype: timedelta64[ns]