I'm having some trouble sorting and then resetting my Index in Pandas:
dfm = dfm.sort(['delt'],ascending=False)
dfm = dfm.reindex(index=range(1,len(dfm)))
The dataframe returns unsorted after I reindex. My ultimate goal is to have a sorted dataframe with index numbers from 1 --> len(dfm) so if there's a better way to do that, I wouldn't mind,
Thanks!
Instead of reindexing, just change the actual index:
dfm.index = range(1,len(dfm) + 1)
Then that wont change the order, just the index
I think you're misunderstanding what reindex does. It uses the passed index to select values along the axis passed, then fills with NaN wherever your passed index doesn't match up with the current index. What you're interested in is just setting the index to something else:
In [12]: df = DataFrame(randn(10, 2), columns=['a', 'delt'])
In [13]: df
Out[13]:
a delt
0 0.222 -0.964
1 0.038 -0.367
2 0.293 1.349
3 0.604 -0.855
4 -0.455 -0.594
5 0.795 0.013
6 -0.080 -0.235
7 0.671 1.405
8 0.436 0.415
9 0.840 1.174
In [14]: df.reindex(index=arange(1, len(df) + 1))
Out[14]:
a delt
1 0.038 -0.367
2 0.293 1.349
3 0.604 -0.855
4 -0.455 -0.594
5 0.795 0.013
6 -0.080 -0.235
7 0.671 1.405
8 0.436 0.415
9 0.840 1.174
10 NaN NaN
In [16]: df.index = arange(1, len(df) + 1)
In [17]: df
Out[17]:
a delt
1 0.222 -0.964
2 0.038 -0.367
3 0.293 1.349
4 0.604 -0.855
5 -0.455 -0.594
6 0.795 0.013
7 -0.080 -0.235
8 0.671 1.405
9 0.436 0.415
10 0.840 1.174
Remember, if you want len(df) to be in the index you have to add 1 to the endpoint since Python doesn't include endpoints when constructing ranges.
Related
I have dataframe where one column looks like
Average Weight (Kg)
0.647
0.88
0
0.73
1.7 - 2.1
1.2 - 1.5
2.5
NaN
1.5 - 1.9
1.3 - 1.5
0.4
1.7 - 2.9
Reproducible data
df = pd.DataFrame([0.647,0.88,0,0.73,'1.7 - 2.1','1.2 - 1.5',2.5 ,np.NaN,'1.5 - 1.9','1.3 - 1.5',0.4,'1.7 - 2.9'],columns=['Average Weight (Kg)'])
where I would like to take average of range entries and replace it in the dataframe e.g. 1.7 - 2.1 will be replaced by 1.9 , following code doesn't work TypeError: 'float' object is not iterable
np.where(df['Average Weight (Kg)'].str.contains('-'), df['Average Weight (Kg)'].str.split('-')
.apply(lambda x: statistics.mean((list(map(float, x)) ))), df['Average Weight (Kg)'])
Another possible solution, which is based on the following ideas:
Convert column to string.
Split each cell by \s-\s.
Explode column.
Convert back to float.
Group by and mean.
df['Average Weight (Kg)'] = df['Average Weight (Kg)'].astype(
str).str.split(r'\s-\s').explode().astype(float).groupby(level=0).mean()
Output:
Average Weight (Kg)
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
edit: slight change to avoid creating a new column
You could go for something like this (renamed your column name to avg, cause it was long to type :-) ):
new_average =(df.avg.str.split('-').str[1].astype(float) + df.avg.str.split('-').str[0].astype(float) ) / 2
df["avg"] = new_average.fillna(df.avg)
yields for avg:
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
Name: avg2, dtype: float64
I have a csv file with bid/ask prices of many bonds (using ISIN identifiers) for the past 1 yr. Using these historical prices, I'm trying to calculate the historical volatility for each bond. Although it should be typically an easy task, the issue is not all bonds have exactly same number of days of trading price data, while they're all in same column and not stacked. Hence if I need to calculate a rolling std deviation, I can't choose a standard rolling window of 252 days for 1 yr.
The data set has this format-
BusinessDate
ISIN
Bid
Ask
Date 1
ISIN1
P1
P2
Date 2
ISIN1
P1
P2
Date 252
ISIN1
P1
P2
Date 1
ISIN2
P1
P2
Date 2
ISIN2
P1
P2
......
& so on.
My current code is as follows-
vol_df = pd.read_csv('hist_prices.csv')
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df[Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df['hist_vol'] = vol_df['log_return'].std() * np.sqrt(252)
The last line of code seems to be giving all NaN values in the column. This is most likely because the operation for calculating the std deviation is happening on the same row number and not for a list of numbers. I tried replacing the last line to use rolling_std-
vol_df.set_index('BusinessDate').groupby('ISIN').rolling(window = 1, freq = 'A').std()['log_return']
But this doesn't help either. It gives 2 numbers for each ISIN. I also tried to use pivot() to place the ISINs in columns and BusinessDate as index, and the Prices as "values". But it gives an error. Also I've close to 9,000 different ISINs and hence putting them in columns to calculate std() for each column may not be the best way. Any clues on how I can sort this out?
I was able to resolve this in a crude way-
vol_df_2 = vol_df.groupby('ISIN')['logret'].std()
vol_df_3 = vol_df_2.to_frame()
vol_df_3.rename(columns = {'logret':'daily_std}, inplace = True)
The first line above was returning a series and the std deviation column named as 'logret'. So the 2nd and 3rd line of code converts it into a dataframe and renames the daily std deviation as such. And finally the annual vol can be calculated using sqrt(252).
If anyone has a better way to do it in the same dataframe instead of creating a series, that'd be great.
ok this almost works now.
It does need some math per ISIN to figure out the rolling period, I just used 3 and 2 in my example, you probably need to count how many days of trading in the year or whatever and fix it at that per ISIN somehow.
And then you need to figure out how to merge the data back. The output actually has errors becuase its updating a copy, but that is kind of what I was looking for here. I am sure someone that knows more could fix it at this point. I can't get it working to do the merge.
toy_data={'BusinessDate': ['10/5/2020','10/6/2020','10/7/2020','10/8/2020','10/9/2020',
'10/12/2020','10/13/2020','10/14/2020','10/15/2020','10/16/2020',
'10/5/2020','10/6/2020','10/7/2020','10/8/2020'],
'ISIN': [1,1,1,1,1, 1,1,1,1,1, 2,2,2,2],
'Bid': [0.295,0.295,0.295,0.295,0.295,
0.296, 0.296,0.297,0.298,0.3,
2.5,2.6,2.71,2.8],
'Ask': [0.301,0.305,0.306,0.307,0.308,
0.315,0.326,0.337,0.348,0.37,
2.8,2.7,2.77,2.82]}
#vol_df = pd.read_csv('hist_prices.csv')
vol_df = pd.DataFrame(toy_data)
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df['Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df.dropna(subset = ['log_return'], inplace=True)
# do some math here to calculate how many days you want to roll for an ISIN
# maybe count how many days over a 1 year period exist???
# not really sure how you'd miss days unless stuff just doesnt trade
# (but I don't need to understand it anyway)
rolling = {1: 3, 2: 2}
for isin in vol_df['ISIN'].unique():
roll = rolling[isin]
print(f'isin={isin}, roll={roll}')
df_single = vol_df[vol_df['ISIN']==isin]
df_single['rolling'] = df_single['log_return'].rolling(roll).std()
# i can't get the right syntax to merge data back, but this shows it
vol_df[isin, 'rolling'] = df_single['rolling']
print(df_single)
print(vol_df)
which outputs (minus the warning errors):
isin=1, roll=3
BusinessDate ISIN Bid Ask Mid Price log_return rolling
1 2020-10-06 1 0.295 0.305 0.3000 0.006689 NaN
2 2020-10-07 1 0.295 0.306 0.3005 0.001665 NaN
3 2020-10-08 1 0.295 0.307 0.3010 0.001663 0.002901
4 2020-10-09 1 0.295 0.308 0.3015 0.001660 0.000003
5 2020-10-12 1 0.296 0.315 0.3055 0.013180 0.006650
6 2020-10-13 1 0.296 0.326 0.3110 0.017843 0.008330
7 2020-10-14 1 0.297 0.337 0.3170 0.019109 0.003123
8 2020-10-15 1 0.298 0.348 0.3230 0.018751 0.000652
9 2020-10-16 1 0.300 0.370 0.3350 0.036478 0.010133
isin=2, roll=2
BusinessDate ISIN Bid ... log_return (1, rolling) rolling
11 2020-10-06 2 2.60 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.71 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.80 ... 2.522656e-02 NaN 0.005778
[3 rows x 8 columns]
BusinessDate ISIN Bid ... log_return (1, rolling) (2, rolling)
1 2020-10-06 1 0.295 ... 6.688988e-03 NaN NaN
2 2020-10-07 1 0.295 ... 1.665279e-03 NaN NaN
3 2020-10-08 1 0.295 ... 1.662511e-03 0.002901 NaN
4 2020-10-09 1 0.295 ... 1.659751e-03 0.000003 NaN
5 2020-10-12 1 0.296 ... 1.317976e-02 0.006650 NaN
6 2020-10-13 1 0.296 ... 1.784313e-02 0.008330 NaN
7 2020-10-14 1 0.297 ... 1.910886e-02 0.003123 NaN
8 2020-10-15 1 0.298 ... 1.875055e-02 0.000652 NaN
9 2020-10-16 1 0.300 ... 3.647821e-02 0.010133 NaN
11 2020-10-06 2 2.600 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.710 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.800 ... 2.522656e-02 NaN 0.005778
I have a pandas dataframe consisting of 12 columns and 900 entries which looks like this:
In [1]: df
Out[2]:
Id BestInGen Ceiling Fitness Floor Generation Name Precision Runid SolutionId Timestamp Value
0 1 True 2.5 2.416582e+11 0.500 1 H1001Thickness1 0.010 20180214142319 4 2018-02-14 14:28:41.391908 0.500
1 2 False 0.1 2.830500e+11 0.015 1 H6512Diameter8 0.005 20180214142319 3 2018-02-14 14:28:41.423109 0.015
2 3 False 2.5 2.830500e+11 0.500 1 H2201Thickness1 0.010 20180214142319 3 2018-02-14 14:28:41.423109 0.500
3 4 False 0.1 2.830500e+11 0.015 1 H2201Diameter1 0.005 20180214142319 3 2018-02-14 14:28:41.423109 0.015
4 5 False 2.5 2.830500e+11 0.500 1 H2201Thickness2 0.010 20180214142319 3 2018-02-14 14:28:41.423109 0.500
I want to pivot this dataframe such that 'Name' is turned into columns, and the rows populated by 'Value'.
Currently I have tried the following:
dfPivot = df.pivot(index='Id',columns='Name',values='Value)
I thought this would create the results I need, and that has been the case for the other threads ive seen. But in my case the following happens
In [3]: dfPivot
Out [4]:
Name H1001Diameter1 H1001Diameter10 H1001Diameter12
Id
1 Nan Nan Nan
And the same continues to the end of the dataframe, all values being Nan. The original datatype is a float64, and there are no Nans in the original data.
Any pointers on how to solve this? Sorry if this is a noob question, or please let me know if you need me to edit my question/example.
try
pd.pivot_table(df[['Id', 'Name', 'Value']],
index='Id',
columns=['Name'],
values=['Value'],
aggfunc=lambda x: x)
This assumes that you don't duplicate values. Otherwise, you need to edit the aggfunc to do proper aggregation
I have dataset, in which I read the data, df.dir.value_counts() returns
169 23042
170 22934
168 22873
316 22872
315 22809
171 22731
317 22586
323 22561
318 22530
...
0.069 1
0.167 1
0557 1
0.093 1
1455 1
0.130 1
0.683 1
2211 1
3.714 1
1.093 1
0819 1
0.183 1
0.110 1
2241 1
0.34 1
0.330 1
0.563 1
60+9 1
0.910 1
0.232 1
1410 1
0.490 1
0.107 1
1.257 1
1704 1
0.491 1
1.180 1
5-230 1
1735 1
1.384 1
The dir column is about direction, and the data should be integer, ranging from (0,361). As you can see, there are a lot of errones data at the end of the value_counts() list.
I want to know, how can I drop the non-integer data?
There are some possible ways
1.read_csv as integer and throw all non-integer data
df = pd.read_csv("/data.dat", names = ['time', 'dir'], dtype={'dir': int}})
However, there some string like error data, such as 60+9, which would cause error. I don't know how to handle it.
2.Select by isdigit(), and then do a downcast
df = df[df['dir'].apply(lambda x: str(x).isdigit())]
df['dir']=pd.to_numeric(df['dir'], downcast='integer', errors='coerce')
This is from Drop rows if value in a specific column is not an integer in pandas dataframe, and works fine for me, but it feels a little bit too much. I'm wondering if there are better approaches?
I like
df.dir[df.dir == df.dir // 1]
How It Works
Consider the dataframe df
df = pd.DataFrame(dict(dir=[1, 1.5, 2, 2.5]))
print(df)
dir
0 1.0
1 1.5
2 2.0
3 2.5
Anything that is an integer should be equal to itself floor divided by one.
df.assign(floor_div=df.dir // 1)
dir floor_div
0 1.0 1.0
1 1.5 1.0
2 2.0 2.0
3 2.5 2.0
So we can test for when they are equal
df.assign(
floor_div=df.dir // 1,
is_int=df.dir // 1 == df.dir
)
dir floor_div is_int
0 1.0 1.0 True
1 1.5 1.0 False
2 2.0 2.0 True
3 2.5 2.0 False
So to filter, we can use the boolean mask in the demo column 'is_int'
df.dir[df.dir == df.dir // 1]
0 1.0
2 2.0
Name: dir, dtype: float64
If there are strings in this column, then you can incorporate pd.to_numeric
df.dir = pd.to_numeric(df.dir, 'coerce')
df.dir[df.dir == df.dir // 1]
I want to apply a function to row slices of dataframe in pandas for each row and returning a dataframe with for each row the value and number of slices that was calculated.
So, for example
df = pandas.DataFrame(numpy.round(numpy.random.normal(size=(2, 10)),2))
f = lambda x: (x - x.mean())
What I want is to apply lambda function f from column 0 to 5 and from column 5 to 10.
I did this:
a = pandas.DataFrame(f(df.T.iloc[0:5,:])
but this is only for the first slice.. how can include the second slice in the code, so that my resulting output frame looks exactly as the input frame -- just that every data point is changed to its value minus the mean of the corresponding slice.
I hope it makes sense.. What would be the right way to go with this?
thank you.
You can simply reassign the result to original df, like this:
import pandas as pd
import numpy as np
# I'd rather use a function than lambda here, preference I guess
def f(x):
return x - x.mean()
df = pd.DataFrame(np.round(np.random.normal(size=(2,10)), 2))
df.T
0 1
0 0.92 -0.35
1 0.32 -1.37
2 0.86 -0.64
3 -0.65 -2.22
4 -1.03 0.63
5 0.68 -1.60
6 -0.80 -1.10
7 -0.69 0.05
8 -0.46 -0.74
9 0.02 1.54
# makde a copy of df here
df1 = df
# just reassign the slices back to the copy
# edited, omit DataFrame part.
df1.T[:5], df1.T[5:] = f(df.T.iloc[0:5,:]), f(df.T.iloc[5:,:])
df1.T
0 1
0 0.836 0.44
1 0.236 -0.58
2 0.776 0.15
3 -0.734 -1.43
4 -1.114 1.42
5 0.930 -1.23
6 -0.550 -0.73
7 -0.440 0.42
8 -0.210 -0.37
9 0.270 1.91