Time difference: incorrect values of seconds - python

I try to count the time difference between each 2 rows for groupby data by id. Data looks like
id date
11 2021-02-04 10:34:46+03:00
11 2021-02-07 14:58:24+03:00
11 2021-02-07 19:23:28+03:00
11 2021-02-08 10:21:44+03:00
11 2021-02-09 11:36:09+03:00
I use that:
df['time_diff'] = df.groupby('id')['date'].diff().dt.seconds.div(60).fillna(0)
I've noticed that my result is incorrect.
And when I use just it
df.groupby('id')['date'].diff()
I get that and it's correct
70225 NaT
72324 3 days 04:23:38
72367 0 days 04:25:04
72515 0 days 14:58:16
73343 1 days 01:14:25
...
But when I try to convert it into seconds
df.groupby('id')['date'].diff().dt.seconds
I get
70225 NaN
72324 15818.0
72367 15904.0
72515 53896.0
73343 4465.0
...
Why might it happen?

It's very difficult to answer this without a reproducible example, or an understanding of your desired behavior.
I suspect that you can do this with pd.Series.dt.total_seconds():
df.groupby('id')['date'].diff().dt.total_seconds()
If that doesn't work, you could try something like:
df.groupby('id')['date'].diff() / pd.Timedelta(seconds=1)

Related

Problem with loop to calculate IRR function in python

I have a problem with calculating a function in python. I want to calculate the IRR for a number of investment, all of which are described in their own dataframe. I have a dataframe for each investment up until a certain date, so I have a multiple dataframe describing the flows of payments the investment has made up until different dates for each investment, with the last row of each dataframe containing the information of the stock of capital that each investment has until that point. I do this in order to have like a time series of the IRR for each investment. Each dataframe of which I want to calculate the IRR is in a list.
To calculate the IRR for each dataframe I made these functions:
def npv(irr, cfs, yrs):
return np.sum(cfs / ((1. + irr) ** yrs))
def irr(cfs, yrs, x0)
return np.asscalar(fsolve(npv, x0=x0, args=(cfs, yrs)))
So in order to calculate the IRR for each dataframe in my list I did:
for i, new_df in enumerate(dfs):
cash_flow = new_df.FLOWS.values
years = new_df.timediff.values
output.loc[i, ['DATE']] = new_df['DATE'].iloc[-1]
output.loc[i, ['Investment']] = new_df['Investment'].iloc[-1]
output.loc[i, ['irr']] = irr(cash_flow, years, x0=0.)
Output is the dataframe I want to create that the cointains the information I want, i.e the IRR of each invesment up until a certain date. The problem is, it calculates the IRR correctly for some dataframes, but not for others. For example it calculates the IRR correctly for this dataframe:
DATE INVESTMENT FLOWS timediff
0 2014-02-24 1 -36278400.0 0.0
1 2014-03-25 1 -11490744.0 0.07945205479452055
2 2015-01-22 1 -13244300.0 0.9095890410958904
3 2015-09-24 1 -10811412.0 1.5808219178082192
4 2015-11-12 1 -6208238.0 1.715068493150685
5 2016-01-22 1 -6210161.0 1.9095890410958904
6 2016-03-31 1 -4535569.0 2.0986301369863014
7 2016-05-25 1 8420470.0 2.249315068493151
8 2016-06-30 1 12357138.0 2.347945205479452
9 2016-07-14 1 3498535.0 2.3863013698630136
10 2016-12-26 1 4085285.0 2.8383561643835615
11 2017-06-07 1 3056835.0 3.2849315068493152
12 2017-09-11 1 11254424.0 3.547945205479452
13 2017-11-16 1 9274834.0 3.728767123287671
14 2018-02-22 1 1622857.0 3.9972602739726026
15 2018-05-23 1 2642985.0 4.243835616438356
18 2018-08-23 1 9265099.0 4.495890410958904
16 2018-11-29 1 1011915.0 4.764383561643836
19 2018-12-28 1 1760734.0 4.843835616438356
17 2019-01-14 1 1940112.0 4.890410958904109
20 2019-06-30 1 116957227.3 5.347945205479452
With an IRR of 0.215. But this dataframe, for the exact same investment it does not. It returns a IRR of 0.0001, but the real IRR should be around 0.216.
DATE INVESTMENT FLOWS timediff
0 2014-02-24 1 -36278400.0 0.0
1 2014-03-25 1 -11490744.0 0.07945205479452055
2 2015-01-22 1 -13244300.0 0.9095890410958904
3 2015-09-24 1 -10811412.0 1.5808219178082192
4 2015-11-12 1 -6208238.0 1.715068493150685
5 2016-01-22 1 -6210161.0 1.9095890410958904
6 2016-03-31 1 -4535569.0 2.0986301369863014
7 2016-05-25 1 8420470.0 2.249315068493151
8 2016-06-30 1 12357138.0 2.347945205479452
9 2016-07-14 1 3498535.0 2.3863013698630136
10 2016-12-26 1 4085285.0 2.8383561643835615
11 2017-06-07 1 3056835.0 3.2849315068493152
12 2017-09-11 1 11254424.0 3.547945205479452
13 2017-11-16 1 9274834.0 3.728767123287671
14 2018-02-22 1 1622857.0 3.9972602739726026
15 2018-05-23 1 2642985.0 4.243835616438356
18 2018-08-23 1 9265099.0 4.495890410958904
16 2018-11-29 1 1011915.0 4.764383561643836
19 2018-12-28 1 1760734.0 4.843835616438356
17 2019-01-14 1 1940112.0 4.890410958904109
20 2019-09-30 1 123753575.7 5.6
These two dataframes have exactly the same flows excepto the last row, of which it cointains the stock of capital up until that date for that investment. So the only difference between these two dataframes is the last row. This means this investment hasn't had any inflows or outflows during that time. I don't understand why the IRR varies so much. Or why some IRR are calculated incorrectly.
Most are calculated correctly but a few are not.
Thanks for helping me.
As I have thought, it is a problem with the optimization method.
When I have tried your irr function with the second df, I have even received a warning:
RuntimeWarning: The iteration is not making good progress, as measured by the
improvement from the last ten iterations.
warnings.warn(msg, RuntimeWarning)
But trying out scipy.optimize.root with other methods seem to work for me. Changed the func to:
import scipy.optimize as optimize
def irr(cfs, yrs, x0):
r = optimize.root(npv, args=(cfs, yrs), x0=x0, method='broyden1')
return float(r.x)
I just checked lm and broyden1, and those both converged with your second example to around 0.216. There are multiple methods, and I have no clue which would be the best choice from those, but most seems to be better then the hybr method used in fsolve.

Plotting counts of a dataframe grouped by timestamp

So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN

time delta in pandas dataframe

Have a question regarding how to create a day count type of column in pandas. Given a list of dates, I want to be able to calculate the difference from one date to the previous date in days. Now, I can do this with simple subtraction and it will return me a timedelta object I think. What if I just want an integer number of days. Using .days seems to work with two dates but I can't get it work with a column.
Let's say I do,
df['day_count'] = (df['INDEX_DATE'] - df['INDEX_DATE'].shift(1))
INDEX_DATE day_count
0 2009-10-06 NaT
1 2009-10-07 1 days
2 2009-10-08 1 days
3 2009-10-09 1 days
4 2009-10-12 3 days
5 2009-10-13 1 days
I get '1 days'....I only want 1.
I can use .day like this which does return me a number, but it won't work handling an entire column.
(df['INDEX_DATE'][1] - df['INDEX_DATE'][0]).days
If I try something like this:
df['day_count'] = (df['INDEX_DATE'] - df['INDEX_DATE'].shift(1)).days
I get an error of
AttributeError: 'Series' object has no attribute 'days'
I can work around '1 days' but I'm thinking there must be a better way to do this.
Try this:
In [197]: df['day_count'] = df.INDEX_DATE.diff().dt.days
In [198]: df
Out[198]:
INDEX_DATE day_count
0 2009-10-06 NaN
1 2009-10-07 1.0
2 2009-10-08 1.0
3 2009-10-09 1.0
4 2009-10-12 3.0
5 2009-10-13 1.0

how to group all the data as fast as possible?

I have 4188006 rows of data. I want to group my data by its column Code value. And set the Code value as the key, the corresponding data as the value int0 a dict`.
The _a_stock_basic_data is my data:
Code date_time open high low close \
0 000001.SZ 2007-03-01 19.000000 19.000000 18.100000 18.100000
1 000002.SZ 2007-03-01 14.770000 14.800000 13.860000 14.010000
2 000004.SZ 2007-03-01 6.000000 6.040000 5.810000 6.040000
3 000005.SZ 2007-03-01 4.200000 4.280000 4.000000 4.040000
4 000006.SZ 2007-03-01 13.050000 13.470000 12.910000 13.110000
... ... ... ... ... ... ...
88002 603989.SH 2015-06-30 44.950001 50.250000 41.520000 49.160000
88003 603993.SH 2015-06-30 10.930000 12.500000 10.540000 12.360000
88004 603997.SH 2015-06-30 21.400000 24.959999 20.549999 24.790001
88005 603998.SH 2015-06-30 65.110001 65.110001 65.110001 65.110001
amt volume
0 418404992 22927500
1 659624000 46246800
2 23085800 3853070
3 131162000 31942000
4 251946000 19093500
.... ....
88002 314528000 6933840
88003 532364992 46215300
88004 169784992 7503370
88005 0 0
[4188006 rows x 8 columns]
And my code is:
_a_stock_basic_data = pandas.concat(dfs)
_all_universe = set(all_universe.values.tolist())
for _code in _all_universe:
_temp_data = _a_stock_basic_data[_a_stock_basic_data['Code']==_code]
data[_code] = _temp_data[_temp_data.notnull()]
_all_universe contains _a_stock_basic_data['Code']. The length of _all_universe is about 2816, and the number of for loop is 2816, it costs a lot of time to complete the process.
So, I just wonder how to use high performance method to group these data. And I think multiprocessing is a choice, but I think share memory is its problem. And I think as the data is more and more large, performance of code need take into consideration, otherwise, it will costs a lot. Thank you for your help.
I'll show an example which I think will solve your problem. Below I make a dataframe with random elements, where the column Code will have duplicate values
a = pd.DataFrame({'a':np.arange(20), 'b':np.random.random(20), 'Code':np.random.random_integers(0, 10, 20)})
To group by the column Code, set it as index:
a.index = a['Code']
you can now use the index to access the data by the value of Code:
In : a.ix[8]
Out:
a b Code
Code
8 1 0.589938 8
8 3 0.030435 8
8 13 0.228775 8
8 14 0.329637 8
8 17 0.915402 8
Did you tried the pd.concat function? Here you can append arrays along an axis of your choice.
pd.concat([data,_temp_data],axis=1)
- dict(_a_stock_basic_data.groupby(['Code']).size())
## Number of occurences per code
- dict(_a_stock_basic_data.groupby(['Code'])['Column_you_want_to_Aggregate'].sum()) ## If you want to do an aggregation on a certain column
?

Is this a Pandas bug with notnull() or a fundamental misunderstanding on my part (probably misunderstanding)

I have a pandas dataframe with two columns and default indexing. The first column is a string and the second is a date. The top date is NaN (though it should be NaT really).
index somestr date
0 ON NaN
1 1C 2014-06-11 00:00:00
2 2C 2014-07-09 00:00:00
3 3C 2014-08-13 00:00:00
4 4C 2014-09-10 00:00:00
5 5C 2014-10-08 00:00:00
6 6C 2014-11-12 00:00:00
7 7C 2014-12-10 00:00:00
8 8C 2015-01-14 00:00:00
9 9C 2015-02-11 00:00:00
10 10C 2015-03-11 00:00:00
11 11C 2015-04-08 00:00:00
12 12C 2015-05-13 00:00:00
Call this dataframe df.
When I run:
df[pd.notnull(df['date'])]
I expect the first row to go away. It doesn't.
If I remove the column with string by setting:
df=df[['date']]
Then apply:
df[pd.notnull(df['date'])]
then the first row with the null does go away.
Also, the row with the null always goes away if all columns are number/date types. When a column with a string appears, this problem occurs.
Surely this is a bug, right? I am not sure if others will be able to replicate this.
This was on my Enthought Canopy for Windows (I am not smart enough for UNIX/Linux command line noise)
Per requests below from Jeff and unutbu:
#ubuntu -
df.dtypes
somestr object
date object
dtype: object
Also:
type(df.iloc[0]['date'])
pandas.tslib.NaTType
In the code this column was specifically assigned as pd.NaT
I also do not understand why it says NaN when it should say NaT. The filtering I used worked fine when I used this toy frame:
df=pd.DataFrame({'somestr' : ['aa', 'bb'], 'date' : [pd.NaT, dt.datetime(2014,4,15)]}, columns=['somestr', 'date'])
It should also be noted that although the table above had NaN in the output, the following output NaT:
df['date'][0]
NaT
Also:
pd.notnull(df['date'][0])
False
pd.notnull(df['date'][1])
True
but....when evaluating the array, they all came back True - bizarre...
np.all(pd.notnull(df['date']))
True
#Jeff - this is 0.12. I am stuck with this. The frame was created by concatenating two different frames that were grabbed from database queries using psql. The date and some other float columns were then added by calculations I did. Of course, I filtered to the two relevant columns that made sense here until I pinpointed that the string valued columns were causing problems.
************ How to Replicate **********
import pandas as pd
import datetime as dt
print(pd.__version__)
# 0.12.0
df = pd.DataFrame({'somestr': ['aa', 'bb'], 'date': ['cc', 'dd']},
columns=['somestr', 'date'])
df['date'].iloc[0] = pd.NaT
df['date'].iloc[1] = pd.to_datetime(dt.datetime(2014, 4, 15))
print(df[pd.notnull(df['date'])])
# somestr date
# 0 aa NaN
# 1 bb 2014-04-15 00:00:00
df2 = df[['date']]
print(df2[pd.notnull(df2['date'])])
# date
# 1 2014-04-15 00:00:00
So, this dataframe originally had all string entries - then the date column was converted to dates with an NaT at the top - note that in the table it is NaN, but when using df.iloc[0]['date'] you do see the NaT. Using the snippet above, you can see that the filtering by not null is bizarre with and without the somestr column. Again - this is Enthought Canopy for Windows with Pandas 0.12 and NumPy 1.8.
I encountered this problem also. Here's how I fixed it. "isnull()" is a function that checks if something is NaN or empty. The "~" (tilde) operator negates the following expression. So we are saying give me a dataframe from your original dataframe but only where the 'data' rows are NOT null.
df = df[~df['data'].isnull()]
Hope this helps!

Categories