I have this Dataframe:
DataFrame
I applied df.groupby ('site') to classify data by this feature.
grouped = Datos.groupby('site')
After classifying it I want to complete, for all records, the "date" column day by day.
The procedure that I think I should follow will be:
1. Generate a complete sequence between start and end date. (Step completed).
for site in grouped:
dates = ['2018-01-01', '2020-01-17']
startDate = datetime.datetime.strptime( dates[0], "%Y-%m-%d") # parse first date
endDate = datetime.datetime.strptime( dates[-1],"%Y-%m-%d") # parse last date
days = (endDate - startDate).days # how many days between?
allDates = {datetime.datetime.strftime(startDate+datetime.timedelta(days=k),
"%Y-%m-%d"):0 for k in range(days+1)}
Compare this sequence with the column 'date' of my groupby. ('Site) and add those that are not present do not match the dates in' date '.
Write a function or loop that allows you to update the 'date' column with the new dates and also complete the missing values with 0.
(grouped.apply(add_days))
So far I have only managed to complete step 1, so I ask for your help to complete steps 2 and 3.
I would very much appreciate your always important help.
Regards
I had to do quiet the same thing for a project:
Maybe it's not the best solution for you but it can help you. (and I hope save you the headache I had)
Here is how as I managed it with help of https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
df_DateRange=pd.DataFrame()
df_1=pd.DataFrame()
grouped=pd.DataFrame()
#1. Create a DataFrame with alldays (your step2):
#Create a DataFrame with alldays
dates_list = ['2019-12-31', '2020-01-05']
df_DateRange['date']=pd.date_range(start=dates_list [0],end=dates_list [-1],freq='1D')
df_DateRange['date']=df_DateRange['date'].dt.strftime('%Y-%m-%d')
df_DateRange.set_index(['date'],inplace=True)
#Set index of you Datos DataFrame:
Datos.set_index(['date'], inplace=True)
#Join both DataFrame:
df_1=df_DateRange.join(Datos)
#2. Replace the NaN:
df_1['site'].fillna("", inplace=True)
df_1['value'].fillna(0, inplace=True)
df_1['value2'].fillna(0, inplace=True)
#3. do the calculation:
grouped = df_1.groupby('site').sum()
df_DateRange:
date
0 2019-12-31
1 2020-01-01
2 2020-01-02
3 2020-01-03
4 2020-01-04
5 2020-01-05
Datos:
date site value value2
0 2020-01-01 site1 1 -1
1 2020-01-01 site2 2 -2
2 2020-01-02 site1 10 -10
3 2020-01-02 site2 20 -20
df1:
site value value2
date
2019-12-31 0.0 0.0
2020-01-01 site1 1.0 -1.0
2020-01-01 site2 2.0 -2.0
2020-01-02 site1 10.0 -10.0
2020-01-02 site2 20.0 -20.0
2020-01-03 0.0 0.0
2020-01-04 0.0 0.0
2020-01-05 0.0 0.0
grouped=
value value2
site
0.0 0.0
site1 11.0 -11.0
site2 22.0 -22.0
Related
I have a dataframe (df) with a date index. And I want to achieve the following:
1. Take Dates column and add one month -> e.g. nxt_dt = df.index + np.timedelta64(month=1) and lets call df.index curr_dt
2. Find the nearest entry in Dates that is >= nxt_dt.
3 Count the rows between curr_dt and nxt_dt and put them into a column in df.
The result is supposed to look like this:
px_volume listed_sh ... iv_mid_6m '30d'
Dates ...
2005-01-03 228805 NaN ... 0.202625 21
2005-01-04 189983 NaN ... 0.203465 22
2005-01-05 224310 NaN ... 0.202455 23
2005-01-06 221988 NaN ... 0.202385 20
2005-01-07 322691 NaN ... 0.201065 21
Needless to mention that there are only dates/rows in the df for which there are observations.
I can think of some different ways to get this done in loops, but since the data I work with is quite big, I would really like to avoid to loop through rows to fill them.
Is there a way in pandas to get this done vectorized?
If you are OK to reindex this should do the job:
import numpy as np
import pandas as pd
df = pd.DataFrame({'date': ['2020-01-01', '2020-01-08', '2020-01-24', '2020-01-29', '2020-02-09', '2020-03-04']})
df['date'] = pd.to_datetime(df['date'])
df['value'] = 1
df = df.set_index('date')
df = df.reindex(pd.date_range('2020-01-01','2020-03-04')).fillna(0)
df = df.sort_index(ascending=False)
df['30d'] = df['value'].rolling(30).sum() - 1
df.sort_index().query("value == 1")
gives:
value 30d
2020-01-01 1.0 3.0
2020-01-08 1.0 2.0
2020-01-24 1.0 2.0
2020-01-29 1.0 1.0
2020-02-09 1.0 NaN
2020-03-04 1.0 NaN
I would like to make an empty pandas series with a date index which is every day of 2020. That is 01-01-2020, 02-01-2020 etc.
Although this looks very simple I couldn’t find out how to do it.
Use date_range:
range_2020 = pd.date_range("2020-01-01", "2020-12-31", freq="D")
pd.DataFrame(range(366), index=range_2020)
The output is:
0
2020-01-01 0
2020-01-02 1
2020-01-03 2
2020-01-04 3
2020-01-05 4
...
I'm trying to return a count on a time window about a (moving) fixed point.
It's an attempt to understand the condition of an instrument at any time, as a function of usage prior to it.
So if the instrument is used at 12.05pm, 12.10, 12.15, 12.30, 12.40 and 1pm, the usage counts would be:
12.05 -> 1 (once in the last hour)
12.10 -> 2
12.15 -> 3
12.30 -> 4
12.40 -> 5
1.00 -> 6
... but then lets say usage resumes at 1.06:
1.06 -> 6
this doesn't increase the count, as the first run is over an hour ago.
How can I calculate this count and append it as a column?
It feels like this is an groupby/aggregate/count using possibly timedeltas in a lambda function, but I don't know where to start past that.
I'd like to be able to play with the time window too, so not just the past hour, but the hour surrounding an instance i.e. + and -30 minutes.
The following code gives a starting dataframe:
s = pd.Series(pd.date_range('2020-1-1', periods=8000, freq='250s'))
df = pd.DataFrame({'Run time': s})
df_sample = df.sample(6000)
df_sample = df_sample.sort_index()
The best help i found (and to be fair i can usually hack together from the logic) was this Distinct count on a rolling time window but i've not managed this time.
Thanks
I've done something similar previously with the DataFrame.rolling function:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
So for your dataset, first you need to update the index to the datetime field, then you can preform the analysis you need, so continuing on from your code:
s = pd.Series(pd.date_range('2020-1-1', periods=8000, freq='250s'))
df = pd.DataFrame({'Run time': s})
df_sample = df.sample(6000)
df_sample = df_sample.sort_index()
# Create a value we can count
df_sample('Occurrences') = 1
# Set the index to the datetime element
df_sample = df_sample.set_index('Run time')
# Use Pandas rolling method, 3600s = 1 Hour
df_sample['Occurrences in Last Hour'] = df_sample['Occurrences'].rolling('3600s').sum()
df_sample.head(15)
Occurrences Occurrences in Last Hour
Run time
2020-01-01 00:00:00 1 1.0
2020-01-01 00:04:10 1 2.0
2020-01-01 00:08:20 1 3.0
2020-01-01 00:12:30 1 4.0
2020-01-01 00:16:40 1 5.0
2020-01-01 00:25:00 1 6.0
2020-01-01 00:29:10 1 7.0
2020-01-01 00:37:30 1 8.0
2020-01-01 00:50:00 1 9.0
2020-01-01 00:54:10 1 10.0
2020-01-01 00:58:20 1 11.0
2020-01-01 01:02:30 1 11.0
2020-01-01 01:06:40 1 11.0
2020-01-01 01:15:00 1 10.0
2020-01-01 01:19:10 1 10.0
You need to set the index to a datetime element to utilised the time base window, otherwise you can only use integer values corresponding to the number of rows.
I just am unable to solve this without applying loops and I have pretty long data of timeseries. I want to know what is the closest next maturity date based on information we know today. Example below: Note the next expiry date should be for that specific code. There has got to be a more pythonic way of doing this.
date matdate code
2-Jan-2018 5-Jan-2018 A
3-Jan-2018 6-Jan-2018 A
8-Jan-2018 12-Jan-2018 B
10-Jan-2018 15-Jan-2018 A
11-Jan-2018 16-Jan-2018 B
15-Jan-2018 17-Jan-2018 A
And I am looking for the output to be in the below format - which takes all weekday dates in the output (the below could also be in pivot format, but should have all weekday dates as index)
date matdate code BusinessDaysToNextMat
2-Jan-2018 5-Jan-2018 A 3
2-Jan 2018 B 0
3-Jan-2018 8-Jan-2018 A 2
3-Jan-2018 B 0
4-Jan-2018 A 1
4-Jan-2018 B 0
5-Jan-2018 A 0
5-Jan-2018 B 0
8-Jan-2018 A 0
8-Jan-2018 17-Jan-2018 B 7
9-Jan-2018 A 0
9-Jan-2018 B 6
10-Jan-2018 16-Jan-2018 A 4
10-Jan-2018 B 6
11-Jan-2018 A 3
11-Jan-2018 16-Jan-2018 B 3
12-Jan-2018 A 4
12-Jan-2018 B 2
15-Jan-2018 17-Jan-2018 A 1
15-Jan-2018 B 1
Thank you very much for taking a look!
You can use numpy.busday_count to achieve that:
import numpy as np
df['BusinessDaysToNextMat'] = df[['date', 'matdate']].apply(lambda x: np.busday_count(*x), axis=1)
df
# date matdate code BusinessDaysToNextMat
#0 2018-01-01 2018-01-05 A 4
#1 2018-01-03 2018-01-06 A 3
#2 2018-01-08 2018-01-12 B 4
#3 2018-01-10 2018-01-15 A 3
#4 2018-01-11 2018-01-16 B 3
#5 2018-01-15 2018-01-17 A 2
#6 2018-01-20 2018-01-22 A 0
This doesn't seem completely what you had in your example, but does the most:
index = pd.MultiIndex.from_product(
[pd.date_range(
df['date'].min(),
df['date'].max(), freq='C').values,
df['code'].unique()],
names = ['date', 'code'])
resampled = pd.DataFrame(index=index).reset_index().merge(df, on=['date', 'code'], how='left')
calc = resampled.dropna()
calc['BusinessDaysToNextMat'] = calc[['date', 'matdate']].apply(lambda x: np.busday_count(*x), axis=1)
final = resampled.merge(calc, on=['date', 'code', 'matdate'], how='left')
final['BusinessDaysToNextMat'].fillna(0, inplace=True)
final
# date code matdate BusinessDaysToNextMat
#0 2018-01-02 A 2018-01-05 3.0
#1 2018-01-02 B NaT 0.0
#2 2018-01-03 A 2018-01-06 3.0
#3 2018-01-03 B NaT 0.0
#4 2018-01-04 A NaT 0.0
#5 2018-01-04 B NaT 0.0
#6 2018-01-05 A NaT 0.0
#7 2018-01-05 B NaT 0.0
#8 2018-01-08 A NaT 0.0
#9 2018-01-08 B 2018-01-12 4.0
#10 2018-01-09 A NaT 0.0
#11 2018-01-09 B NaT 0.0
#12 2018-01-10 A 2018-01-15 3.0
#13 2018-01-10 B NaT 0.0
#14 2018-01-11 A NaT 0.0
#15 2018-01-11 B 2018-01-16 3.0
#16 2018-01-12 A NaT 0.0
#17 2018-01-12 B NaT 0.0
#18 2018-01-15 A 2018-01-17 2.0
#19 2018-01-15 B NaT 0.0
here is what I am doing currently, which clearly isn't most efficient:
# Step1: Make a new df with data of just one code and fill up any blank matdates with the very first available matdate. After that:
temp_df['newmatdate'] = datetime.date(2014,1,1) # create a temp column to hold the current minimum maturity date
temp_df['BusinessDaysToNextMat'] = 0 # this is the column that we are after
mindates = [] # initiate a list to maintain any new maturity dates which come up and keep it min-sorted
mindates.append(dummy) # where dummy is the very first available maturity date (as of 1st date we only know one trade, which is this) Have written dummy here, but it is a longer code, which may not pertain here
x = mindates[0] # create a variable to be used in the loop
g = datetime.datetime.now()
for i in range(len(temp_df['matdate'])): # loop through every date
if np.in1d(temp_df['matdate'][i],mindates)[0]==False: # if the current maturity date found DOES NOT exist in the list of mindates, add it
mindates.append(temp_df['matdate'][i])
while min(mindates)< temp_df['date'][i]: # if the current date is greater than the min mindate held so far,
mindates.sort() # sort it so you are sure to remove the min mindate
x = mindates[0] # note the date which you are dropping before dropping it
del mindates[0] # drop the curr min mindate, so the next mindate, becomes the new min mindate
if temp_df['matdate'][i] != x: # I think this might be redundant, but it is basically checking if the new matdate which you may be adding, wasn't the one
mindates.append(temp_df['matdate'][i]) # which you just removed, if not, add this new one to the list
curr_min = min(mindates)
temp_df['newmatdate'][i] = curr_min # add the current min mindate to the column
h = datetime.datetime.now()
print('loop took '+str((h-g).seconds) + ' seconds')
date = [d.date() for d in temp_df['date']] # convert from 'date' to 'datetime' to be able to use np.busday_count()
newmatdate = [d.date() for d in temp_df['newmatdate']]
temp_df['BusinessDaysToNextMat'] = np.busday_count(date,newmatdate) # phew
Also this is just for a single code - and then i will loop it over as many codes there are
I want to make a historical dataframe with values from time series dataframe.
Today, I have df1 as below:
df1:
A B C
0 1.0 2.0 3.0
Tomorrow, I will have df1 as below:
df1:
A B C
0 1.5 2.6 3.7
So the output I want tomorrow is as below:
df2:
A B C
0 1.0 2.0 3.0
1 1.5 2.6 3.7
I just want to keep add each day's new value from df1 to a new dataframe df2 so that I can make a historical dataframe with daily values. Can you help me on this? Thank you.
From my understanding, you've got a source that updates once every day that you load to df1. Then you'd like to add that df1 to a df2 that stores all the values that you've seen in df1 so far.
I'm basing my suggestion on a df1 with the same structure as yours, but with random values. Every time you run this code, it will append those values to a text file df2.txt stored in the folder c:\timeseries.
Here we go:
Add a folder C:/timeseries/ to your system. Then add an empty .txt file, enter the string dates,A,B,C, and save it as df2.txt.
The following snippet will take the length of that textfile and use that to build on a daily index to mimic your situation. That index will be the date for your df1 that is otherwise filled with random numbers every time the snippet is run. And for each time the snippet is run, the data from df1 will be appended to df2.
So, run this snippet once...
# imports
import os
import pandas as pd
import numpy as np
os.chdir('C:/timeseries/')
# creates df1 with random numbers
df1 = pd.DataFrame(np.random.randint(0,10,size=(1, 3)), columns=list('ABC'))
# Read your historic values (will be empty the first time you run it)
df2 = pd.read_csv('df2.txt', sep=",")
df2 = df2.set_index(['dates'])
# To mimic your real life situation, I'm adding a timeseries with a datestamp
# that starts where df2 ends. If df2 i empty, it starts from 01.01.2018
# Make a dummy datelist to mimic your situation
datelist = pd.date_range(pd.datetime(2018, 1, len(df2)).strftime('%Y-%m-%d'), periods=1).tolist()
df1['dates'] = datelist
df1 = df1.set_index(['dates'])
df1.index = pd.to_datetime(df1.index)
df2 = df2.append(df1)
df2.to_csv('df2.txt')
print(df2)
... to get this output:
A B C
dates
2018-01-01 00:00:00 8.0 6.0 8.0
Those are the current values of df1 and df2 at the time being. I'm not using a random seed here, so your data will differ from mine.
Run it ten times in a row and you'll get this:
A B C
dates
2018-01-01 00:00:00 8.0 6.0 8.0
2018-01-02 00:00:00 9.0 1.0 0.0
2018-01-03 00:00:00 3.0 1.0 3.0
2018-01-04 00:00:00 4.0 7.0 6.0
2018-01-05 00:00:00 1.0 4.0 3.0
2018-01-06 00:00:00 3.0 7.0 6.0
2018-01-07 00:00:00 8.0 6.0 4.0
2018-01-08 00:00:00 4.0 7.0 0.0
2018-01-09 00:00:00 0.0 9.0 8.0
2018-01-10 00:00:00 8.0 4.0 8.0
In order to start from scratch, go ahead and delete all rows but the first in your df2.txt file.
I hope this is what you're looking for. If not, let me know.
Use pd.concat
df1 = pd.concat([df1, df2])
or pd.DataFrame.append
df1 = df1.append(df2)