Get N Rows from Value in Pandas - python

I have a list of monthly sales numbers for events. I have a column Event_Ind that indicates whether that month had an event. I need to get the 3 values (inclusive) prior to each event. Values are allowed to overlap.
import pandas as pd
dates = pd.date_range(start='2019-01-01', end='2020-01-01', freq='M')
values = [1000,1067,1099,1100,2000,1000,1057,1082,1200,1300,1453,1500]
event_ind = ["*","","","","*","","","","*","","*",""]
df = pd.DataFrame({'Dates':dates, 'Values':values, 'Event_Ind':event_ind})
Dates Values Event_Ind
0 2019-01-31 1000 *
1 2019-02-28 1067
2 2019-03-31 1099
3 2019-04-30 1100
4 2019-05-31 2000 *
5 2019-06-30 1000
6 2019-07-31 1057
7 2019-08-31 1082
8 2019-09-30 1200 *
9 2019-10-31 1300
10 2019-11-30 1453 *
11 2019-12-31 1500
Goal would be for this sample data:
Dates Values Event_Ind
0 1/31/2019 1000 *
1 3/31/2019 1099
2 4/30/2019 1100
3 5/31/2019 2000 *
4 7/31/2019 1057
5 8/31/2019 1082
6 9/30/2019 1200 *
7 9/30/2019 1200 *
8 10/31/2019 1300
9 11/30/2019 1453 *
I'm thinking I can do something with shift() or groupby.tail(). But I can't seem to use them to get my desired output

You could something along these lines:
s = df.Event_Ind.eq('*')
i = np.concatenate([np.arange(a,b+1) for b,a in zip(s[s].index, s[s].index - 2)])
df.loc[i[i>=0]]
Dates Values Event_Ind
0 2019-01-31 1000 *
1 2019-02-28 1067
2 2019-03-31 1099
3 2019-04-30 1100
4 2019-05-31 2000 *
5 2019-06-30 1000
6 2019-07-31 1057
7 2019-08-31 1082
8 2019-09-30 1200 *
7 2019-08-31 1082
8 2019-09-30 1200 *
9 2019-10-31 1300
10 2019-11-30 1453 *
Explanation
[np.arange(a,b+1) for b,a in zip(s[s].index, s[s].index - 2)]
The above code zips the indexes values with * with indexes two rows above. Thus, np.arange(a,b+1) yields the indexes of rows you want to show at the final df.
Since the above generates a list of arrays, you want to np.concatenate all that to have a single array of indexes to keep.
df.loc[i[i>=0]]
Finally, the above first filters all values in i that are negative (because negative indexes in python have a meaning) and df.loc[] that to retrieve the final df.

Try:
x=df["Event_Ind"]=="*"
ind=list(map(lambda i: any(x[i:i+3]), range(len(x))))
print(df.loc[ind])
Output:
Dates Values Event_Ind
0 2019-01-31 1000 *
2 2019-03-31 1099
3 2019-04-30 1100
4 2019-05-31 2000 *
6 2019-07-31 1057
7 2019-08-31 1082
8 2019-09-30 1200 *
9 2019-10-31 1300
10 2019-11-30 1453 *
[Program finished]

Related

Python Pandas finds cumulative Min per group

I tried to find the Desired_Output column, which is defined as follow: For each Name and Subj group, find the Min of all previous Score.
Name Date Subj Score Desired_Output
A 2022-05-11 1200 70.88 69.60
A 2022-03-20 1200 69.96 69.60
A 2022-02-23 1200 69.60 69.63
A 2022-01-26 1200 69.63 70.22
A 2022-01-05 1200 70.35 70.22
A 2021-12-08 1200 70.22 70.69
A 2021-11-17 1000 56.73 null
A 2021-11-10 1200 70.69 null
B 2022-05-07 1600 96.16 96.53
B 2022-04-24 1600 94.53 null
B 2022-03-20 2000 124.60 null
B 2022-02-27 1800 109.16 null
B 2022-02-03 1400 82.54 null
Here is the dataset:
pd.DataFrame({
'Name': ['A','A','A','A','A','A','A','A','B','B','B','B','B'],
'Date': ['2022-05-11','2022-03-20','2022-02-23','2022-01-26','2022-01-05','2021-12-08','2021-11-17','2021-11-10','2022-05-07','2022-04-24','2022-03-20','2022-02-27','2022-02-03'],
'Subj': [1200,1200,1200,1200,1200,1200,1000,1200,1600,1600,2000,1800,1400],
'Score': [70.88,69.96,69.6,69.63,70.35,70.22,56.73,70.69,96.16,94.53,124.6,109.16,82.54]})
I don't know how to achieve that in Pandas, especially without looping the DataFrame.
Assuming the dates are sorted in reverse order, you can use a reversed cummin+shift per group:
df['Desired'] = (df[::-1]
.groupby(['Name', 'Subj'])['Score']
.apply(lambda s: s.cummin().shift())
)
Output:
Name Date Subj Score Desired
0 A 2022-05-11 1200 70.88 69.60
1 A 2022-03-20 1200 69.96 69.60
2 A 2022-02-23 1200 69.60 69.63
3 A 2022-01-26 1200 69.63 70.22
4 A 2022-01-05 1200 70.35 70.22
5 A 2021-12-08 1200 70.22 70.69
6 A 2021-11-17 1000 56.73 NaN
7 A 2021-11-10 1200 70.69 NaN
8 B 2022-05-07 1600 96.16 94.53
9 B 2022-04-24 1600 94.53 NaN
10 B 2022-03-20 2000 124.60 NaN
11 B 2022-02-27 1800 109.16 NaN
12 B 2022-02-03 1400 82.54 NaN

Join two dataframes based on non-matching date and time using pandas and nearest-method

I have edited the question based on #user17242583 comments and added more data and the output I get using just the example data.
I have two csv-files that I have read into two DataFrames, df1 and df2.
df1 (11 rows):
ID DateTime Event_no
0 1 01/01/2019 0:02 1003
1 2 01/01/2019 0:28 1015
2 3 01/01/2019 1:43 1029
3 4 01/01/2019 2:12 1042
4 5 01/01/2019 2:58 1055
5 6 01/01/2019 11:02 1068
6 7 01/01/2019 12:02 1081
7 8 01/01/2019 13:46 1094
8 9 01/01/2019 21:02 1107
9 10 01/01/2019 21:32 1120
10 11 01/01/2019 23:37 1133
df2 (14 rows):
ID lat lon DateTime
0 15 50.7823 90.564000 01/01/2019 0:03
1 16 51.4852 90.473600 01/01/2019 0:29
2 17 50.2981 90.387600 01/01/2019 3:53
3 18 50.3710 90.298667 01/01/2019 4:03
4 19 50.1289 90.210467 01/01/2019 5:03
5 20 49.8868 90.122267 01/01/2019 11:03
6 21 49.6447 90.034067 01/01/2019 13:03
7 22 49.4026 89.945867 01/01/2019 15:03
8 23 49.1605 89.857667 01/01/2019 19:03
9 24 48.9184 89.769467 01/01/2019 21:03
10 25 48.6763 89.681267 01/01/2019 22:03
11 26 48.4342 89.593067 01/01/2019 22:23
12 27 48.1921 89.504867 01/01/2019 23:03
13 28 47.9500 89.416667 01/01/2019 23:43
I need to join these two DataFrames based on the nearest date and time so that the joined Dataframe looks like this and meets these conditions:
df_join (11 rows)
all the events need to be joined with one location
one location can be joined with multiple events
some locations don't have an event to join to:
ID lat lon DateTime Event_no
0 15 50.7823 90.564000 01/01/2019 0:03 1003
1 16 51.4852 90.473600 01/01/2019 0:29 1015
2 16 51.4852 90.473600 01/01/2019 0:29 1029
3 17 50.2981 90.387600 01/01/2019 3:53 1042
4 17 50.2981 90.387600 01/01/2019 3:53 1055
5 20 49.8868 90.122267 01/01/2019 11:03 1068
6 20 49.8868 90.122267 01/01/2019 11:03 1081
7 21 49.6447 90.034067 01/01/2019 13:03 1094
8 24 48.9184 89.769467 01/01/2019 21:03 1107
9 25 48.6763 89.681267 01/01/2019 22:03 1120
10 28 47.9500 89.416667 01/01/2019 23:43 1133
Following #jezrael's answer here I've written the following code:
import pandas as pd
df1 = pd.read_csv("path/filename1.csv")
df2 = pd.read_csv("path/filename2.csv")
df1['DateTime'] = pd.to_datetime(df1.DateTime)
df2['DateTime'] = pd.to_datetime(df2.DateTime)
df1.sort_values('DateTime', inplace=True)
df2.sort_values('DateTime', inplace=True)
df1_join = df1.set_index('DateTime').reindex(df2.set_index('DateTime').index, method='nearest').reset_index()
df1_merge = (pd.merge(df2, df1_join, on='DateTime'))
df1_merge.to_csv("path/filename_join.csv"")
The code runs through just fine but doesn't give me the results I need.
print(df1_join)
DateTime ID Event_no
0 2019-01-01 00:03:00 1 1003
1 2019-01-01 00:29:00 2 1015
2 2019-01-01 03:53:00 5 1055
3 2019-01-01 04:03:00 5 1055
4 2019-01-01 05:03:00 5 1055
5 2019-01-01 11:03:00 6 1068
6 2019-01-01 13:03:00 8 1094
7 2019-01-01 15:03:00 8 1094
8 2019-01-01 19:03:00 9 1107
9 2019-01-01 21:03:00 9 1107
10 2019-01-01 22:03:00 10 1120
11 2019-01-01 22:23:00 10 1120
12 2019-01-01 23:03:00 11 1133
13 2019-01-01 23:43:00 11 1133
It doesn’t join each event to one location (events 1029, 1042 and 1081 missing)
The code allows one event to be joined with multiple locations (1055, 1094, 1107, 1120 and 1133)
Any advice on how to edit the code so that the previously conditions are met?

Index match equivalent in Python

I have a large dataset I'm trying to manipulate for further analysis. Below is what the relevant parts of the dataframe would look like.
Loan Closing Balance Date
1 175,000 2010-10-31
1 150,000 2010-11-30
1 125,000 2010-12-31
2 275,000 2010-10-31
2 250,000 2010-11-30
2 225,000 2010-12-31
3 375,000 2010-10-31
3 350,000 2010-11-30
3 320,000 2010-12-31
I would like to create a new column called Opening Balance which is basically the Closing Balance for the previous month's month end, so for the second row, Opening Balance would just be equal to 175,000, which is the Closing Balance for the first row.
As dataset starts 2010-10-31, I won't be able to look up a balance for 2010-09-30, so for any row with a date of 2010-10-31, I want to make the Opening Balance for that observation equal to the Closing Balance.
Here's what it should look like:
Loan Closing Balance Date Opening Balance
1 175,000 2010-10-31 175,000
1 150,000 2010-11-30 175,000
1 125,000 2010-12-31 150,000
2 275,000 2010-10-31 275,000
2 250,000 2010-11-30 275,000
2 225,000 2010-12-31 250,000
3 375,000 2010-10-31 375,000
3 350,000 2010-11-30 375,000
3 320,000 2010-12-31 350,000
In Excel I would normally do a compound index match with an eomonth function thrown in to do this but not quite sure how to do this in Python (still very new to it).
Any help appreciated.
I've tried the approach suggested by Santhosh and I get the following:
Thanks I tried your solution and end up getting the following:
Closing Balance_x Date_x Closing Balance_y
0 175000 2010-09-30 150000.0
1 175000 2010-09-30 250000.0
2 175000 2010-09-30 350000.0
3 150000 2010-10-31 125000.0
4 150000 2010-10-31 225000.0
5 150000 2010-10-31 320000.0
6 125000 2010-11-30 NaN
7 275000 2010-09-30 150000.0
8 275000 2010-09-30 250000.0
9 275000 2010-09-30 350000.0
10 250000 2010-10-31 125000.0
11 250000 2010-10-31 225000.0
12 250000 2010-10-31 320000.0
13 225000 2010-11-30 NaN
14 375000 2010-09-30 150000.0
15 375000 2010-09-30 250000.0
16 375000 2010-09-30 350000.0
17 350000 2010-10-31 125000.0
18 350000 2010-10-31 225000.0
19 350000 2010-10-31 320000.0
20 320000 2010-11-30 NaN
I then amended that code to do a merge based off of the Loan ID and Date/pDate:
final_df = pd.merge(df, df, how="left", left_on=['Date'], right_on=['pDate'])
Loan Closing Balance_x Date_x Opening Balance
0 1 175000 2010-09-30 150000.0
1 1 150000 2010-10-31 125000.0
2 1 125000 2010-11-30 NaN
3 2 275000 2010-09-30 250000.0
4 2 250000 2010-10-31 225000.0
5 2 225000 2010-11-30 NaN
6 3 375000 2010-09-30 350000.0
7 3 350000 2010-10-31 320000.0
8 3 320000 2010-11-30 NaN
Now in this case I'm not sure why I get NaN on every November observation. The Opening Balance for Loan 1 in November should be 150,000. The October Opening Balance should be 175,000. And the September Opening Balance should just be defaulted to the same as the September Opening Balance since I do not have an August Closing Balance to refer to.
Update
Think I resolved the issue, I changed the merge code to:
final_df = pd.merge(df, df, how="left", left_on=['Loan','pDate'], right_on=['Loan','Date'])
This still gets me NaN for September observations but that is fine as I can do a manual replace of those values.
I suggest you have another column that says Date - (1month) and then join them on the date fields to get opening balance.
df["cmonth"] = df.Date.apply(lambda x: x.year*100+x.month)
df["pDate"] = df.Date.apply(lambda x: (x - pd.DateOffset(months=1)))
df["pmonth"] = df.pDate.apply(lambda x: x.year*100+x.month)
final_df = pd.merge(df, df, how="left", left_on="cmonth", right_on="pmonth")
print(final_df[["close_x", "Date_x", "close_y"]])
#close_y is your opening balance

Grouping daily data by month in python/pandas while firstly grouping by user id

I have the table below in a Pandas dataframe:
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
..
The table was extracted from a csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of my users and for this purpose:
I would like to group (for each user - they are thousands) queries by month summing the query whole_cost for the entire month e.g. if user_id=1 was has a whole cost of 1790 on 02/10/2012 with cost1 12 and on the 07/10/2012 with whole cost 364, then it should have an entry in the new table of 2154 (as the whole cost) on 31/10/2012 (end of the month end-point representing the month - all dates in the transformed table will be month ends representing the whole month to which they relate).
In 0.14 you'll be able to groupby monthly and another column at the same time:
In [11]: df
Out[11]:
user_id whole_cost cost1
2012-10-02 1 1790 12
2012-10-07 1 364 15
2013-01-30 1 280 10
2013-02-02 1 259 24
2013-03-05 1 201 39
2012-10-02 3 623 1
2012-12-07 3 90 0
2013-01-30 3 312 90
2013-02-02 5 359 45
2013-03-05 5 301 34
2013-02-02 5 359 1
2013-03-05 5 801 12
In [12]: df1 = df.sort_index() # requires sorted DatetimeIndex
In [13]: df1.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
Out[13]:
user_id
2012-10-31 1 2154
3 623
2012-12-31 3 90
2013-01-31 1 280
3 312
2013-02-28 1 259
5 718
2013-03-31 1 201
5 1102
Name: whole_cost, dtype: int64
until 0.14 I think you're stuck with doing two groupbys:
In [14]: g = df.groupby('user_id')['whole_cost']
In [15]: g.resample('M', how='sum').dropna()
Out[15]:
user_id
1 2012-10-31 2154
2013-01-31 280
2013-02-28 259
2013-03-31 201
3 2012-10-31 623
2012-12-31 90
2013-01-31 312
5 2013-02-28 718
2013-03-31 1102
dtype: float64
With timegrouper getting deprecated, you can replace it with Grouper to get the same results
df.groupby(['user_id', pd.Grouper(key='date', freq='M')]).agg({'whole_cost':sum})
df.groupby(['user_id', df['date'].dt.dayofweek]).agg({'whole_cost':sum})

summing two columns in a pandas dataframe

when I use this syntax it creates a series rather than adding a column to my new dataframe sum.
My code:
sum = data['variance'] = data.budget + data.actual
My dataframe data currently has everything except the budget - actual column. How do I create a variance column?
cluster date budget actual budget - actual
0 a 2014-01-01 00:00:00 11000 10000 1000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
I think you've misunderstood some python syntax, the following does two assignments:
In [11]: a = b = 1
In [12]: a
Out[12]: 1
In [13]: b
Out[13]: 1
So in your code it was as if you were doing:
sum = df['budget'] + df['actual']  # a Series
# and
df['variance'] = df['budget'] + df['actual'] # assigned to a column
The latter creates a new column for df:
In [21]: df
Out[21]:
cluster date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
In [22]: df['variance'] = df['budget'] + df['actual']
In [23]: df
Out[23]:
cluster date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
As an aside, you shouldn't use sum as a variable name as the overrides the built-in sum function.
df['variance'] = df.loc[:,['budget','actual']].sum(axis=1)
You could also use the .add() function:
df.loc[:,'variance'] = df.loc[:,'budget'].add(df.loc[:,'actual'])
Same thing can be done using lambda function.
Here I am reading the data from a xlsx file.
import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name = 4)
print df
Output:
cluster Unnamed: 1 date budget actual
0 a 2014-01-01 00:00:00 11000 10000
1 a 2014-02-01 00:00:00 1200 1000
2 a 2014-03-01 00:00:00 200 100
3 b 2014-04-01 00:00:00 200 300
4 b 2014-05-01 00:00:00 400 450
5 c 2014-06-01 00:00:00 700 1000
6 c 2014-07-01 00:00:00 1200 1000
7 c 2014-08-01 00:00:00 200 100
8 c 2014-09-01 00:00:00 200 300
Sum two columns into 3rd new one.
df['variance'] = df.apply(lambda x: x['budget'] + x['actual'], axis=1)
print df
Output:
cluster Unnamed: 1 date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
If "budget" has any NaN values but you don't want it to sum to NaN then try:
def fun (b, a):
if math.isnan(b):
return a
else:
return b + a
f = np.vectorize(fun, otypes=[float])
df['variance'] = f(df['budget'], df_Lp['actual'])
This is the most elegant solution which follows DRY and work absolutely great.
dataframe_name['col1', 'col2', 'col3'].sum(axis = 1, skipna = True)
Thank you.
eval lets you sum and create columns right away:
In [12]: data.eval('variance = budget + actual', inplace=True)
In [13]: data
Out[13]:
cluster date budget actual variance
0 a 2014-01-01 00:00:00 11000 10000 21000
1 a 2014-02-01 00:00:00 1200 1000 2200
2 a 2014-03-01 00:00:00 200 100 300
3 b 2014-04-01 00:00:00 200 300 500
4 b 2014-05-01 00:00:00 400 450 850
5 c 2014-06-01 00:00:00 700 1000 1700
6 c 2014-07-01 00:00:00 1200 1000 2200
7 c 2014-08-01 00:00:00 200 100 300
8 c 2014-09-01 00:00:00 200 300 500
Since inplace=True you don't need to assign it back to data.

Categories