how to group all the data as fast as possible?

how to group all the data as fast as possible? - python

I have 4188006 rows of data. I want to group my data by its column Code value. And set the Code value as the key, the corresponding data as the value int0 a dict`.
The _a_stock_basic_data is my data:
Code date_time open high low close \
0 000001.SZ 2007-03-01 19.000000 19.000000 18.100000 18.100000
1 000002.SZ 2007-03-01 14.770000 14.800000 13.860000 14.010000
2 000004.SZ 2007-03-01 6.000000 6.040000 5.810000 6.040000
3 000005.SZ 2007-03-01 4.200000 4.280000 4.000000 4.040000
4 000006.SZ 2007-03-01 13.050000 13.470000 12.910000 13.110000
... ... ... ... ... ... ...
88002 603989.SH 2015-06-30 44.950001 50.250000 41.520000 49.160000
88003 603993.SH 2015-06-30 10.930000 12.500000 10.540000 12.360000
88004 603997.SH 2015-06-30 21.400000 24.959999 20.549999 24.790001
88005 603998.SH 2015-06-30 65.110001 65.110001 65.110001 65.110001
amt volume
0 418404992 22927500
1 659624000 46246800
2 23085800 3853070
3 131162000 31942000
4 251946000 19093500
.... ....
88002 314528000 6933840
88003 532364992 46215300
88004 169784992 7503370
88005 0 0
[4188006 rows x 8 columns]
And my code is:
_a_stock_basic_data = pandas.concat(dfs)
_all_universe = set(all_universe.values.tolist())
for _code in _all_universe:
_temp_data = _a_stock_basic_data[_a_stock_basic_data['Code']==_code]
data[_code] = _temp_data[_temp_data.notnull()]
_all_universe contains _a_stock_basic_data['Code']. The length of _all_universe is about 2816, and the number of for loop is 2816, it costs a lot of time to complete the process.
So, I just wonder how to use high performance method to group these data. And I think multiprocessing is a choice, but I think share memory is its problem. And I think as the data is more and more large, performance of code need take into consideration, otherwise, it will costs a lot. Thank you for your help.

I'll show an example which I think will solve your problem. Below I make a dataframe with random elements, where the column Code will have duplicate values
a = pd.DataFrame({'a':np.arange(20), 'b':np.random.random(20), 'Code':np.random.random_integers(0, 10, 20)})
To group by the column Code, set it as index:
a.index = a['Code']
you can now use the index to access the data by the value of Code:
In : a.ix[8]
Out:
a b Code
Code
8 1 0.589938 8
8 3 0.030435 8
8 13 0.228775 8
8 14 0.329637 8
8 17 0.915402 8

Did you tried the pd.concat function? Here you can append arrays along an axis of your choice.
pd.concat([data,_temp_data],axis=1)

- dict(_a_stock_basic_data.groupby(['Code']).size())
## Number of occurences per code
- dict(_a_stock_basic_data.groupby(['Code'])['Column_you_want_to_Aggregate'].sum()) ## If you want to do an aggregation on a certain column
?

Related

Historical Volatility from Prices of many different bonds in same column

I have a csv file with bid/ask prices of many bonds (using ISIN identifiers) for the past 1 yr. Using these historical prices, I'm trying to calculate the historical volatility for each bond. Although it should be typically an easy task, the issue is not all bonds have exactly same number of days of trading price data, while they're all in same column and not stacked. Hence if I need to calculate a rolling std deviation, I can't choose a standard rolling window of 252 days for 1 yr.
The data set has this format-
BusinessDate
ISIN
Bid
Ask
Date 1
ISIN1
P1
P2
Date 2
ISIN1
P1
P2
Date 252
ISIN1
P1
P2
Date 1
ISIN2
P1
P2
Date 2
ISIN2
P1
P2
......
& so on.
My current code is as follows-
vol_df = pd.read_csv('hist_prices.csv')
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df[Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df['hist_vol'] = vol_df['log_return'].std() * np.sqrt(252)
The last line of code seems to be giving all NaN values in the column. This is most likely because the operation for calculating the std deviation is happening on the same row number and not for a list of numbers. I tried replacing the last line to use rolling_std-
vol_df.set_index('BusinessDate').groupby('ISIN').rolling(window = 1, freq = 'A').std()['log_return']
But this doesn't help either. It gives 2 numbers for each ISIN. I also tried to use pivot() to place the ISINs in columns and BusinessDate as index, and the Prices as "values". But it gives an error. Also I've close to 9,000 different ISINs and hence putting them in columns to calculate std() for each column may not be the best way. Any clues on how I can sort this out?

I was able to resolve this in a crude way-
vol_df_2 = vol_df.groupby('ISIN')['logret'].std()
vol_df_3 = vol_df_2.to_frame()
vol_df_3.rename(columns = {'logret':'daily_std}, inplace = True)
The first line above was returning a series and the std deviation column named as 'logret'. So the 2nd and 3rd line of code converts it into a dataframe and renames the daily std deviation as such. And finally the annual vol can be calculated using sqrt(252).
If anyone has a better way to do it in the same dataframe instead of creating a series, that'd be great.

ok this almost works now.
It does need some math per ISIN to figure out the rolling period, I just used 3 and 2 in my example, you probably need to count how many days of trading in the year or whatever and fix it at that per ISIN somehow.
And then you need to figure out how to merge the data back. The output actually has errors becuase its updating a copy, but that is kind of what I was looking for here. I am sure someone that knows more could fix it at this point. I can't get it working to do the merge.
toy_data={'BusinessDate': ['10/5/2020','10/6/2020','10/7/2020','10/8/2020','10/9/2020',
'10/12/2020','10/13/2020','10/14/2020','10/15/2020','10/16/2020',
'10/5/2020','10/6/2020','10/7/2020','10/8/2020'],
'ISIN': [1,1,1,1,1, 1,1,1,1,1, 2,2,2,2],
'Bid': [0.295,0.295,0.295,0.295,0.295,
0.296, 0.296,0.297,0.298,0.3,
2.5,2.6,2.71,2.8],
'Ask': [0.301,0.305,0.306,0.307,0.308,
0.315,0.326,0.337,0.348,0.37,
2.8,2.7,2.77,2.82]}
#vol_df = pd.read_csv('hist_prices.csv')
vol_df = pd.DataFrame(toy_data)
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df['Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df.dropna(subset = ['log_return'], inplace=True)
# do some math here to calculate how many days you want to roll for an ISIN
# maybe count how many days over a 1 year period exist???
# not really sure how you'd miss days unless stuff just doesnt trade
# (but I don't need to understand it anyway)
rolling = {1: 3, 2: 2}
for isin in vol_df['ISIN'].unique():
roll = rolling[isin]
print(f'isin={isin}, roll={roll}')
df_single = vol_df[vol_df['ISIN']==isin]
df_single['rolling'] = df_single['log_return'].rolling(roll).std()
# i can't get the right syntax to merge data back, but this shows it
vol_df[isin, 'rolling'] = df_single['rolling']
print(df_single)
print(vol_df)
which outputs (minus the warning errors):
isin=1, roll=3
BusinessDate ISIN Bid Ask Mid Price log_return rolling
1 2020-10-06 1 0.295 0.305 0.3000 0.006689 NaN
2 2020-10-07 1 0.295 0.306 0.3005 0.001665 NaN
3 2020-10-08 1 0.295 0.307 0.3010 0.001663 0.002901
4 2020-10-09 1 0.295 0.308 0.3015 0.001660 0.000003
5 2020-10-12 1 0.296 0.315 0.3055 0.013180 0.006650
6 2020-10-13 1 0.296 0.326 0.3110 0.017843 0.008330
7 2020-10-14 1 0.297 0.337 0.3170 0.019109 0.003123
8 2020-10-15 1 0.298 0.348 0.3230 0.018751 0.000652
9 2020-10-16 1 0.300 0.370 0.3350 0.036478 0.010133
isin=2, roll=2
BusinessDate ISIN Bid ... log_return (1, rolling) rolling
11 2020-10-06 2 2.60 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.71 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.80 ... 2.522656e-02 NaN 0.005778
[3 rows x 8 columns]
BusinessDate ISIN Bid ... log_return (1, rolling) (2, rolling)
1 2020-10-06 1 0.295 ... 6.688988e-03 NaN NaN
2 2020-10-07 1 0.295 ... 1.665279e-03 NaN NaN
3 2020-10-08 1 0.295 ... 1.662511e-03 0.002901 NaN
4 2020-10-09 1 0.295 ... 1.659751e-03 0.000003 NaN
5 2020-10-12 1 0.296 ... 1.317976e-02 0.006650 NaN
6 2020-10-13 1 0.296 ... 1.784313e-02 0.008330 NaN
7 2020-10-14 1 0.297 ... 1.910886e-02 0.003123 NaN
8 2020-10-15 1 0.298 ... 1.875055e-02 0.000652 NaN
9 2020-10-16 1 0.300 ... 3.647821e-02 0.010133 NaN
11 2020-10-06 2 2.600 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.710 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.800 ... 2.522656e-02 NaN 0.005778

Problem with loop to calculate IRR function in python

I have a problem with calculating a function in python. I want to calculate the IRR for a number of investment, all of which are described in their own dataframe. I have a dataframe for each investment up until a certain date, so I have a multiple dataframe describing the flows of payments the investment has made up until different dates for each investment, with the last row of each dataframe containing the information of the stock of capital that each investment has until that point. I do this in order to have like a time series of the IRR for each investment. Each dataframe of which I want to calculate the IRR is in a list.
To calculate the IRR for each dataframe I made these functions:
def npv(irr, cfs, yrs):
return np.sum(cfs / ((1. + irr) ** yrs))
def irr(cfs, yrs, x0)
return np.asscalar(fsolve(npv, x0=x0, args=(cfs, yrs)))
So in order to calculate the IRR for each dataframe in my list I did:
for i, new_df in enumerate(dfs):
cash_flow = new_df.FLOWS.values
years = new_df.timediff.values
output.loc[i, ['DATE']] = new_df['DATE'].iloc[-1]
output.loc[i, ['Investment']] = new_df['Investment'].iloc[-1]
output.loc[i, ['irr']] = irr(cash_flow, years, x0=0.)
Output is the dataframe I want to create that the cointains the information I want, i.e the IRR of each invesment up until a certain date. The problem is, it calculates the IRR correctly for some dataframes, but not for others. For example it calculates the IRR correctly for this dataframe:
DATE INVESTMENT FLOWS timediff
0 2014-02-24 1 -36278400.0 0.0
1 2014-03-25 1 -11490744.0 0.07945205479452055
2 2015-01-22 1 -13244300.0 0.9095890410958904
3 2015-09-24 1 -10811412.0 1.5808219178082192
4 2015-11-12 1 -6208238.0 1.715068493150685
5 2016-01-22 1 -6210161.0 1.9095890410958904
6 2016-03-31 1 -4535569.0 2.0986301369863014
7 2016-05-25 1 8420470.0 2.249315068493151
8 2016-06-30 1 12357138.0 2.347945205479452
9 2016-07-14 1 3498535.0 2.3863013698630136
10 2016-12-26 1 4085285.0 2.8383561643835615
11 2017-06-07 1 3056835.0 3.2849315068493152
12 2017-09-11 1 11254424.0 3.547945205479452
13 2017-11-16 1 9274834.0 3.728767123287671
14 2018-02-22 1 1622857.0 3.9972602739726026
15 2018-05-23 1 2642985.0 4.243835616438356
18 2018-08-23 1 9265099.0 4.495890410958904
16 2018-11-29 1 1011915.0 4.764383561643836
19 2018-12-28 1 1760734.0 4.843835616438356
17 2019-01-14 1 1940112.0 4.890410958904109
20 2019-06-30 1 116957227.3 5.347945205479452
With an IRR of 0.215. But this dataframe, for the exact same investment it does not. It returns a IRR of 0.0001, but the real IRR should be around 0.216.
DATE INVESTMENT FLOWS timediff
0 2014-02-24 1 -36278400.0 0.0
1 2014-03-25 1 -11490744.0 0.07945205479452055
2 2015-01-22 1 -13244300.0 0.9095890410958904
3 2015-09-24 1 -10811412.0 1.5808219178082192
4 2015-11-12 1 -6208238.0 1.715068493150685
5 2016-01-22 1 -6210161.0 1.9095890410958904
6 2016-03-31 1 -4535569.0 2.0986301369863014
7 2016-05-25 1 8420470.0 2.249315068493151
8 2016-06-30 1 12357138.0 2.347945205479452
9 2016-07-14 1 3498535.0 2.3863013698630136
10 2016-12-26 1 4085285.0 2.8383561643835615
11 2017-06-07 1 3056835.0 3.2849315068493152
12 2017-09-11 1 11254424.0 3.547945205479452
13 2017-11-16 1 9274834.0 3.728767123287671
14 2018-02-22 1 1622857.0 3.9972602739726026
15 2018-05-23 1 2642985.0 4.243835616438356
18 2018-08-23 1 9265099.0 4.495890410958904
16 2018-11-29 1 1011915.0 4.764383561643836
19 2018-12-28 1 1760734.0 4.843835616438356
17 2019-01-14 1 1940112.0 4.890410958904109
20 2019-09-30 1 123753575.7 5.6
These two dataframes have exactly the same flows excepto the last row, of which it cointains the stock of capital up until that date for that investment. So the only difference between these two dataframes is the last row. This means this investment hasn't had any inflows or outflows during that time. I don't understand why the IRR varies so much. Or why some IRR are calculated incorrectly.
Most are calculated correctly but a few are not.
Thanks for helping me.

As I have thought, it is a problem with the optimization method.
When I have tried your irr function with the second df, I have even received a warning:
RuntimeWarning: The iteration is not making good progress, as measured by the
improvement from the last ten iterations.
warnings.warn(msg, RuntimeWarning)
But trying out scipy.optimize.root with other methods seem to work for me. Changed the func to:
import scipy.optimize as optimize
def irr(cfs, yrs, x0):
r = optimize.root(npv, args=(cfs, yrs), x0=x0, method='broyden1')
return float(r.x)
I just checked lm and broyden1, and those both converged with your second example to around 0.216. There are multiple methods, and I have no clue which would be the best choice from those, but most seems to be better then the hybr method used in fsolve.

Pandas create column with names of columns with lowest match

I have Pandas dataframe where I have points and corresponding lengths to another points. I am able to get minimal value of the calculated columns, however, I need the column names itself. I am unable to figure out how can I get the column names corresponding to values in a new column. My dataframe looks like this:
df.head():
0 1 2 ... 6 7 min
9 58.0 94.0 984.003636 ... 696.667367 218.039561 218.039561
71 100.0 381.0 925.324708 ... 647.707783 169.856557 169.856557
61 225.0 69.0 751.353014 ... 515.152768 122.377490 122.377490
0 and 1 are datapoints, the rest are distances to datapoints #1 to 7, in some cases the number of points can differ, does not really matter for the question. The code I use to count min is following:
new = users.iloc[:,2:].min(axis=1)
users["min"] = new
#could also do the following way
#users.assign(Min=lambda users: users.iloc[:,2:].min(1))
This is quite simple and there is no much about finding the minimum of multiple columns. However, I need to get the col name instead of the value. So my desired output would look like this (in the example all are 7, which is not rule):
0 1 2 ... 6 7 min
9 58.0 94.0 984.003636 ... 696.667367 218.039561 7
71 100.0 381.0 925.324708 ... 647.707783 169.856557 7
61 225.0 69.0 751.353014 ... 515.152768 122.377490 7
Is there a simple way to achieve this?

Use df.idxmin:
In [549]: df['min'] = df.iloc[:,2:].idxmin(axis=1)
In [550]: df
Out[550]:
0 1 2 6 7 min
9 58.0 94.0 984.003636 696.667367 218.039561 7
71 100.0 381.0 925.324708 647.707783 169.856557 7
61 225.0 69.0 751.353014 515.152768 122.377490 7

Next day or next row index in Pandas Data frame

I am trying to find a way to get the next day (next row in this case) in a Pandas dataframe. I thought this would be easy to find but Im struggling.
Starting Data:
ts = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts.columns = ['Val']
ts['Week'] = ts.index.week
ts
Val Week
2000-01-01 -0.639345 52
2000-01-02 1.294537 52
2000-01-03 1.181486 1
2000-01-04 -0.011694 1
2000-01-05 -0.224887 1
2000-01-06 -0.493120 1
2000-01-07 1.439436 1
2000-01-08 1.017722 1
2000-01-09 1.125153 1
2000-01-10 0.209741 2
Subset of the data:
tsSig = ts[ts.Val>1.5].drop_duplicates(subset='Week')
tsSig.head()
Val Week
2000-01-24 2.215559 4
2000-02-09 1.561941 6
2000-02-24 1.645916 8
2000-03-16 1.745079 11
2000-04-10 1.570023 15
I now want to use the index from my tsSig subset to find the next day in tsand then create a new column ts['Val_Dayplus1'] showing values from the 25th(-0.309811), 10th(-1.644814) etc.
I am trying things like ts.loc[tsSig.index].shift(1) to get next day but this is obviously not correct....
Desired output:
Val Val_Dayplus1 Week
2000-01-24 2.215559 -0.309811 4
2000-02-09 1.561941 -1.644814 6
2000-02-24 1.645916 -0.187440 8
(for all rows in tsSig.index)
Edit:
This appears to give me what I need in terms of shifting the date index on tsSig.index. I would like to hear of any other ways to do this as well.
ts.loc[tsSig.index + pd.DateOffset(days=1)]
tsSig['Val_Dayplus1'] = ts['Val'].ix[tsSig.index + pd.DateOffset(days=1)].values

I managed to work this one out so sharing the answer:
ts.loc[tsSig.index + pd.DateOffset(days=1)]
tsSig['Val_Dayplus1'] = ts['Val'].ix[tsSig.index + pd.DateOffset(days=1)].values
tsSig
Val Week Val_Dayplus1
2000-02-15 1.551125 7 -0.102154
2000-02-24 1.525402 8 -0.009776
2000-03-11 1.801845 10 0.832837
2000-03-22 1.546953 12 0.377510
2000-04-17 1.568720 16 -0.258558
2000-06-04 1.646147 22 0.853044

I'm not sure if I completely understand your question, but in general, you can index through a pandas data frame by using df.iloc[ROWS,COLS]. So, in your case, for an index i in a for loop, you could index ts.iloc[i+1,:], to get all of the info from the next row in the ts dataframe.

Finding the percent change of values in a Series

I have a DataFrame with 2 columns. I need to know at what point the number of questions has increased.
In [19]: status
Out[19]:
seconds questions
0 751479 9005591
1 751539 9207129
2 751599 9208994
3 751659 9210429
4 751719 9211944
5 751779 9213287
6 751839 9214916
7 751899 9215924
8 751959 9216676
9 752019 9217533
I need the change in percent of 'questions' column and then sort on it. This does not work:
status.pct_change('questions').sort('questions').head()
Any suggestions?

Try this way instead:
>>> status['change'] = status.questions.pct_change()
>>> status.sort_values('change', ascending=False)
questions seconds change
0 9005591 751479 NaN
1 9207129 751539 0.022379
2 9208994 751599 0.000203
6 9214916 751839 0.000177
4 9211944 751719 0.000164
3 9210429 751659 0.000156
5 9213287 751779 0.000146
7 9215924 751899 0.000109
9 9217533 752019 0.000093
8 9216676 751959 0.000082
pct_change can be performed on Series as well as DataFrames and accepts an integer argument for the number of periods you want to calculate the change over (the default is 1).
I've also assumed that you want to sort on the 'change' column with the greatest percentage changes showing first...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to group all the data as fast as possible? - python

Did you tried the pd.concat function? Here you can append arrays along an axis of your choice. pd.concat([data,_temp_data],axis=1)

- dict(_a_stock_basic_data.groupby(['Code']).size()) ## Number of occurences per code - dict(_a_stock_basic_data.groupby(['Code'])['Column_you_want_to_Aggregate'].sum()) ## If you want to do an aggregation on a certain column ?

Related

Historical Volatility from Prices of many different bonds in same column

Problem with loop to calculate IRR function in python

Pandas create column with names of columns with lowest match

Next day or next row index in Pandas Data frame

Finding the percent change of values in a Series

Categories

Resources