I have a dataframe, I am struggling to create a column based out of other columns, I will share the problem for a sample data.
Date Target1 Close
0 2018-05-25 198.0090 188.580002
1 2018-05-25 197.6835 188.580002
2 2018-05-25 198.0090 188.580002
3 2018-05-29 196.6230 187.899994
4 2018-05-29 196.9800 187.899994
5 2018-05-30 197.1375 187.500000
6 2018-05-30 196.6965 187.500000
7 2018-05-30 196.8750 187.500000
8 2018-05-31 196.2135 186.869995
9 2018-05-31 196.2135 186.869995
10 2018-05-31 196.5600 186.869995
11 2018-05-31 196.7700 186.869995
12 2018-05-31 196.9275 186.869995
13 2018-05-31 196.2135 186.869995
14 2018-05-31 196.2135 186.869995
15 2018-06-01 197.2845 190.240005
16 2018-06-01 197.2845 190.240005
17 2018-06-04 201.2325 191.830002
18 2018-06-04 201.4740 191.830002
I want to create another column (for each observation) (called days_to_hit_target for example) which is the difference of days such that close hits (or crosses target of specific day), then it counts the difference of days and put them in the column days_to_hit_target.
The idea is, suppose close price today in 2018-05-25 is 188.58, so, I want to get the date for which this target (198.0090) is hit close which it is doing somewhere later on 2018-06-04, where close has reached to the target of first observation, (198.0090), that will be fed to the first observation of the column (days_to_hit_target ).
Use a combination of loc and at to find the date at which the target is hit, then subtract the dates.
df['TargetDate'] = 'NA'
for i, row in df.iterrows():
t = row['Target1']
d = row['Date']
targdf = df.loc[df['Close'] >= t]
if len(targdf)>0:
targdt = targdf.at[0,'Date']
df.at[i,'TargetDate'] = targdt
else:
df.at[i,'TargetDate'] = '0'
df['Diff'] = df['Date'].sub(df['TargetDate'], axis=0)
import pandas as pd
csv = pd.read_csv(
'sample.csv',
parse_dates=['Date']
)
csv.sort_values('Date', inplace=True)
def find_closest(row):
target = row['Target1']
date = row['Date']
matches = csv[
(csv['Close'] >= target) &
(csv['Date'] > date)
]
closest_date = matches['Date'].iloc[0] if not matches.empty else None
row['days to hit target'] = (closest_date - date).days if closest_date else None
return row
final = csv.apply(find_closest, axis=1)
It's a bit hard to test because none of the targets appear in the close. But the idea is simple. Subset your original frame such that date is after the current row date and Close is greater than or equal to Target1 and get the first entry (this is after you've sorted it using df.sort_values.
If the subset is empty, use None. Otherwise, use the Date. Days to hit target is pretty simple at that point.
Related
I have a simple definition which finds the third friday of the month. I use this function to populate the dataframe for the third fridays and that part works fine.
The trouble I'm having is finding the third friday for an expiration_date that doesn't fall on a third friday.
This is my code simplified:
import pandas as pd
def is_third_friday(d):
return d.weekday() == 4 and 15 <= d.day <= 21
x = ['09/23/2022','09/26/2022','09/28/2022','09/30/2022','10/3/2022','10/5/2022',
'10/7/2022','10/10/2022','10/12/2022','10/14/2022','10/17/2022','10/19/2022','10/21/2022',
'10/24/2022','10/26/2022','10/28/2022','11/4/2022','11/18/2022','12/16/2022','12/30/2022',
'01/20/2023','03/17/2023','03/31/2023','06/16/2023','06/30/2023','09/15/2023','12/15/2023',
'01/19/2024','06/21/2024','12/20/2024','01/17/2025']
df = pd.DataFrame(x)
df.rename( columns={0 :'expiration_date'}, inplace=True )
df['expiration_date']= pd.to_datetime(df['expiration_date'])
expiration_date = df['expiration_date']
df["is_third_friday"] = [is_third_friday(x) for x in expiration_date]
third_fridays = df.loc[df['is_third_friday'] == True]
df["current_monthly_exp"] = third_fridays['expiration_date'].min()
df["monthly_exp"] = third_fridays[['expiration_date']]
df.to_csv(path_or_buf = f'C:/Data/Date Dataframe.csv',index=False)
What I'm looking for is any expiration_date that is prior to the monthly expire, I want to populate the dataframe with that monthly expire. If it's past the monthly expire date I want to populate the dataframe with the following monthly expire.
I thought I'd be able to use a new dataframe with only the monthly expires as a lookup table and do a timedelta, but when you look at 4/21/2023 and 7/21/2023 these dates don't exist in that dataframe.
This is my current output:
This is the output I'm seeking:
I was thinking I could handle this problem with something like:
date_df["monthly_exp"][0][::-1].expanding().min()[::-1]
But, it wouldn't solve for the 4/21/2023 and 7/21/2023 problem. Additionally, Pandas won't let you do this in a datetime dataframe.
>>> df = pd.DataFrame([1, nan,2,nan,nan,nan,4])
>>> df
0
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 NaN
6 4.0
>>> df["b"] = df[0][::-1].expanding().min()[::-1]
>>> df
0 b
0 1.0 1.0
1 NaN 2.0
2 2.0 2.0
3 NaN 4.0
4 NaN 4.0
5 NaN 4.0
6 4.0 4.0
I've also tried something like the following in many different forms with little luck:
if df['is_third_friday'].any() == True:
df["monthly_exp"] = third_fridays[['expiration_date']]
else:
df["monthly_exp"] = third_fridays[['expiration_date']].shift(third_fridays)
Any suggestions to get me in the right direction would be appreciated. I've been stuck on this problem for sometime.
You could add these additional lines of code (to replace df["monthly_exp"] = third_fridays[['expiration_date']]:
# DataFrame of fridays from minimum expiration_date to 30 days after last
fri_3s = pd.DataFrame(pd.date_range(df["expiration_date"].min(),
df["expiration_date"].max()+pd.tseries.offsets.Day(30),
freq="W-FRI"),
columns=["monthly_exp"])
# only keep those that are between 15th and 21st (as your function did)
fri_3s = fri_3s[fri_3s.monthly_exp.dt.day.between(15, 21)]
# merge_asof to get next third friday
df = pd.merge_asof(df, fri_3s, left_on="expiration_date", right_on="monthly_exp", direction="forward")
This creates a second DataFrame with the 3rd Fridays, and then by merging with merge_asof returns the next of these from the expiration_date.
And to simplify your date_df["monthly_exp"][0][::-1].expanding().min()[::-1] and use it for datetime, you could instead write df["monthly_exp"].bfill() (which backward fills). As you mentioned, this will only include Fridays that exist in your DataFrame already, so creating a list of the possible Fridays might be the easiest way.
Consider this set of data:
data = [{'Year':'1959:01','0':138.89,'1':139.39,'2':139.74,'3':139.69,'4':140.68,'5':141.17},
{'Year':'1959:07','0':141.70,'1':141.90,'2':141.01,'3':140.47,'4':140.38,'5':139.95},
{'Year':'1960:01','0':139.98,'1':139.87,'2':139.75,'3':139.56,'4':139.61,'5':139.58}]
How can I convert to Pandas time series, like this:
Year Value
1959-01 138.89
1959-02 139.39
1959-03 139.74
...
1959-07 141.70
1959-08 141.90
...
Code
df = pd.DataFrame(data).set_index('Year').stack().droplevel(1)
df.index=pd.date_range(start=pd.to_datetime(df.index, format='%Y:%m')[0],
periods=len(df.index), freq='M').to_period('M')
df = df.to_frame().reset_index().rename(columns={'index': 'Year', (0):'Value'})
Explanation
Converting the df to series using stack and dropping the level which is not required.
Then resetting the index for desired range and since we need the output in monthly freq, hence doing that using to_period.
Last step is to convert series back to frame and rename columns.
Output as required
Year Value
0 1959-01 138.89
1 1959-02 139.39
2 1959-03 139.74
3 1959-04 139.69
4 1959-05 140.68
5 1959-06 141.17
6 1959-07 141.70
7 1959-08 141.90
8 1959-09 141.01
9 1959-10 140.47
10 1959-11 140.38
11 1959-12 139.95
12 1960-01 139.98
13 1960-02 139.87
14 1960-03 139.75
15 1960-04 139.56
16 1960-05 139.61
17 1960-06 139.58
here is one way
s = pd.DataFrame(data).set_index("Year").stack()
s.index = pd.Index([pd.to_datetime(start, format="%Y:%m") + pd.DateOffset(months=int(off))
for start, off in s.index], name="Year")
df = s.to_frame("Value")
First we set Year as the index and then stack the values next to it. Then prepare an index from the current index via available dates + other values as month offsets. Lastly go to a frame with new column's name being Value.
to get
>>> df
Value
Year
1959-01-01 138.89
1959-02-01 139.39
1959-03-01 139.74
1959-04-01 139.69
1959-05-01 140.68
1959-06-01 141.17
1959-07-01 141.70
1959-08-01 141.90
1959-09-01 141.01
1959-10-01 140.47
1959-11-01 140.38
1959-12-01 139.95
1960-01-01 139.98
1960-02-01 139.87
1960-03-01 139.75
1960-04-01 139.56
1960-05-01 139.61
1960-06-01 139.58
As the title says, i want to know how to set every n-th value in a python list as Null. I looked after a solution in a lot of forums but i didn't find much.
I also don't want to overwrite existing values as None, instead i want to create new spaces with the value None
The list contains the date (12 dates = 1 year) and every 13th value should be empty because that row will be the average so i don't need a date
Here is my code how i generated the dates with pandas
import pandas as pd
numdays = 370 #i have 370 values, every day = 1 month. Starting from 1990 till June 2019
date1 = '1990-01-01'
date2 = '2019-06-01'
mydates = pd.date_range(date1, date2,).tolist()
date_all = pd.date_range(start=date1, end=date2, freq='1BMS')
date_lst = [date_all]
The expected Output:
01.01.1990
01.02.1990
01.03.1990
01.04.1990
01.05.1990
01.06.1990
01.07.1990
01.08.1990
01.09.1990
01.10.1990
01.11.1990
01.12.1990
None
01.01.1991
.
.
.
If I understood correctly:
import pandas as pd
numdays = 370
date1 = '1990-01-01'
date2 = '2019-06-01'
mydates = pd.date_range(date1, date2,).tolist()
date_all = pd.date_range(start=date1, end=date2, freq='1BMS')
date_lst = [date_all]
for i in range(12,len(mydates),13): # add this
mydates.insert(i, None)
I saw some of the answers above, but there's a way of doing this without having to loop over the complete list:
date_lst[12::12] = [None] * len(date_lst[12::12])
The first 12 in [12::12] means that the first item that should be changed is item number 12. The second 12 means that from then on every 12th item should be changed.
You add a step in iloc and set values this way.
lets generate some dummy data.
df = pd.DataFrame({'Vals' :
pd.date_range('01-01-19','02-02-19',freq='D')})
print(df)
Vals
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-01-05
5 2019-01-06
6 2019-01-07
7 2019-01-08
now, you can decide your step
step = 5
new_df = df.iloc[step::step]
print(new_df)
Vals
5 2019-01-06
10 2019-01-11
15 2019-01-16
20 2019-01-21
25 2019-01-26
30 2019-01-31
now, if you want to write a value to a specific column then -
df['Vals'].iloc[step::step] = pd.NaT
print(df)
Vals
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-01-05
5 NaT
Here is an example of setting null if the element of the list is in the 3rd position, you can make this 13th position by changed ((index+1)%13 == 0)
data = [1,2,3,4,5,6,7,8,9]
data = [None if ((index+1)%3 == 0) else d for index, d in enumerate(data)]
print(data)
output:
[1, 2, None, 4, 5, None, 7, 8, None]
According to your code try this:
date_lst = list(date_all)
dateWithNone = [None if ((index+1)%13 == 0) else d for index, d in enumerate(date_lst)]
print(dateWithNone)
I have to match summary/subtotal table with details table, while identify non-matching items in details table.
Since the time differences between details record and subtotal records deviates significantly (+/-), the method i applied using pandas merge_asof() was not good enough even-though it matches based on a given timedelta value it will not check whether the amounts are equal between two tables.
Is there a way to code, considering each value in subtotal table, calculate subtotals from detail table, and if matches then move into next item in subtotal table and start subtotaling from the next item in details table.
Appreciate if someone can help on this issue.
import pandas as pd
import datetime as dt
subtotal = pd.DataFrame(data = {'Date':['21/09/2018 17:45:27','21/09/2018 19:10:24','21/09/2018 21:42:03'],
'Amount':[2000,3000,6000],
'Ref':[1,2,3]},columns=['Date', 'Amount', 'Ref'])
detail = pd.DataFrame(data = {'Date':['21/09/2018 17:37:05','21/09/2018 17:56:22','21/09/2018 17:56:53','21/09/2018 18:54:56','21/09/2018 19:12:56','21/09/2018 19:15:30 ','21/09/2018 21:35:59','21/09/2018 21:36:20','21/09/2018 21:43:32 '],
'Amount':[1000,500,500,1000,3000,12000,1000,2000,3000]},
columns=['Date', 'Amount'])
subtotal['Date'] = pd.to_datetime(subtotal['Date'])
detail['Date'] = pd.to_datetime(detail['Date'])
# Code i tried with pandas .merge_asof()
subtotal_sorted = subtotal.sort_values(by='Date')
detail_sorted = detail.sort_values(by='Date')
subtotal_sorted.index = subtotal_sorted['Date']
detail_sorted.index = detail_sorted['Date']
tol = pd.Timedelta('15 minute')
result = pd.merge_asof(left=detail_sorted,right=subtotal_sorted, right_index=True,left_index=True,direction='nearest',tolerance=tol)
"I'm expecting a resultant table similar to this." But just using pandas mergeasof() does not match the subtotal values. So i have to look into a another method.
Ref DateTime Value Result Ref_1 DateTime_1 Value_1
1 09/21/2018 17:37 1,000.00 Index1 1 09/21/2018 17:45 2000
2 09/21/2018 17:56 500.00 Index1 1 09/21/2018 17:45 2000
3 09/21/2018 17:56 500.00 Index1 1 09/21/2018 17:45 2000
4 09/21/2018 18:54 1,000.00 Index2 2 09/21/2018 19:10 3000
5 09/21/2018 19:12 2,000.00 Index2 2 09/21/2018 19:10 3000
6 09/21/2018 19:15 12,000.00 No Match
7 09/21/2018 21:35 1,000.00 Index3 3 09/21/2018 21:42 6000
8 09/21/2018 21:36 2,000.00 Index3 3 09/21/2018 21:42 6000
9 09/21/2018 21:43 3,000.00 Index3 3 09/21/2018 21:42 6000"
I suspect this is one of those problems where it is not going to be easy to get an exact match on everything. Anyway I took a shot at it.
First let us define a function doing the merge. This is pretty much the same thing you did already, only with the addition of summing over Amount_detail for all matching subtotals, keepoing only the rows where the sums match.
def merge(subtotal, detail, tol):
subtotal.sort_values(by='Date', inplace=True)
detail.sort_values(by='Date', inplace=True)
# We merge using merge_asof as before
result = pd.merge_asof(left=detail,right=subtotal, on='Date',
direction='nearest',tolerance=tol)
# We total amount_detail over the matching ref
result['sum_amount_detail'] = result.groupby(['Ref'])['Amount_detail'].transform('sum')
# If sum_amount_detail == Amount_subtotal we have a match!!
match = result[result['sum_amount_detail'] == result['Amount_subtotal']]
# Otherwise... no
no_match = result[result['sum_amount_detail'] != result['Amount_subtotal']]
detail_match = match[['Date', 'Amount_detail', 'Ref']].copy()
detail_no_match = no_match[['Date', 'Amount_detail']].copy()
subtotal_match = subtotal[subtotal['Ref'].isin(detail_match['Ref'].unique())].copy()
subtotal_no_match = subtotal[~subtotal['Ref'].isin(detail_match['Ref'].unique())].copy()
return detail_match, subtotal_match, detail_no_match, subtotal_no_match
Now using this function with your original criteria (15 min tolerance)
import pandas as pd
subtotal = pd.DataFrame(data = {'Date':['21/09/2018 17:45:27','21/09/2018 19:10:24','21/09/2018 21:42:03'],
'Amount_subtotal':[2000,3000,6000],
'Ref':[1,2,3]},columns=['Date', 'Amount_subtotal', 'Ref'])
detail = pd.DataFrame(data = {'Date':['21/09/2018 17:37:05','21/09/2018 17:56:22','21/09/2018 17:56:53','21/09/2018 18:54:56','21/09/2018 19:12:56','21/09/2018 19:15:30 ','21/09/2018 21:35:59','21/09/2018 21:36:20','21/09/2018 21:43:32 '],
'Amount_detail':[1000,500,500,1000,3000,12000,1000,2000,3000]},
columns=['Date', 'Amount_detail'])
subtotal['Date'] = pd.to_datetime(subtotal['Date'])
detail['Date'] = pd.to_datetime(detail['Date'])
tol = pd.Timedelta('15 minute')
detail_match, subtotal_match, detail_no_match, subtotal_no_match = merge(subtotal, detail, tol)
This works okey except that there is one obvious flaw, and that is that the subtotal happening at 2018-09-21 19:10:24 (3000) is not included. This is because it also mathced with another value so the total overshoot out subtotal.
One workaround is to do the merge in a loop where we keep increasing the tolerance… that way we first get the closest matches... and then the matches getting ever further away. Its not pretty but it works.
tolerances = [pd.Timedelta('5 minute'), pd.Timedelta('10 minute'), pd.Timedelta('15 minute')]
subtotal_no_match = subtotal.copy()
detail_no_match = detail.copy()
detail_list = []
subtotal_list = []
for tol in tolerances:
detail_match, subtotal_match, detail_no_match, subtotal_no_match = merge(subtotal_no_match, detail_no_match, tol)
if len(detail_match) > 0:
detail_list.append(detail_match)
subtotal_list.append(subtotal_match)
if len(subtotal_no_match)==0:
# We have matched everything in subtotal
break
detail_final = pd.concat(detail_list)
subtotal_final = pd.concat(subtotal_list)
detail_final
Out[5]:
Date Amount_detail Ref
4 2018-09-21 19:12:56 3000 2.0
5 2018-09-21 21:35:59 1000 3.0
6 2018-09-21 21:36:20 2000 3.0
7 2018-09-21 21:43:32 3000 3.0
0 2018-09-21 17:37:05 1000 1.0
1 2018-09-21 17:56:22 500 1.0
2 2018-09-21 17:56:53 500 1.0
subtotal_final
Out[6]:
Date Amount_subtotal Ref
1 2018-09-21 19:10:24 3000 2
2 2018-09-21 21:42:03 6000 3
0 2018-09-21 17:45:27 2000 1
So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN