I am trying to apply a map across several columns in pandas to reflect when data is invalid. When data is invalid in my df['Count'] column, I want to set my df['Value'], df['Lower Confidence Interval'], df['Upper Confidence Interval'] and df['Denominator'] columns to -1.
This is a sample of the dataframe:
Count Value Lower Confidence Interval Upper Confidence Interval Denominator
121743 54.15758428 53.95153779 54.36348867 224794
280 91.80327869 88.18009411 94.38654088 305
430 56.95364238 53.39535553 60.44152684 755
970 70.54545455 68.0815009 72.89492873 1375
nan
70 28.57142857 23.27957213 34.52488678 245
125 62.5 55.6143037 68.91456314 200
Currently, I am trying:
set_minus_1s = {np.nan: -1, '*': -1, -1: -1}
then:
df[['Value', 'Count', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Denominator']] = df['Count'].map(set_minus_1s)
and getting the error:
ValueError: Must have equal len keys and value when setting with an iterable
Is there any way of chaining the column references to make one call to the map rather than have separate lines for each column to call the set_minus_1s dictionary as a map?
I think you can use where or mask and replace all rows where not isnull after apply map:
val = df['Count'].map(set_minus_1s)
print (val)
0 NaN
1 NaN
2 NaN
3 NaN
4 -1.0
5 NaN
6 NaN
Name: Count, dtype: float64
cols =['Value','Count','Lower Confidence Interval','Upper Confidence Interval','Denominator']
df[cols] = df[cols].where(val.isnull(), val, axis=0)
print (df)
Count Value Lower Confidence Interval Upper Confidence Interval \
0 121743.0 54.157584 53.951538 54.363489
1 280.0 91.803279 88.180094 94.386541
2 430.0 56.953642 53.395356 60.441527
3 970.0 70.545455 68.081501 72.894929
4 -1.0 -1.000000 -1.000000 -1.000000
5 70.0 28.571429 23.279572 34.524887
6 125.0 62.500000 55.614304 68.914563
Denominator
0 224794.0
1 305.0
2 755.0
3 1375.0
4 -1.0
5 245.0
6 200.0
cols = ['Value', 'Count', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Denominator']
df[cols] = df[cols].mask(val.notnull(), val, axis=0)
print (df)
Count Value Lower Confidence Interval Upper Confidence Interval \
0 121743.0 54.157584 53.951538 54.363489
1 280.0 91.803279 88.180094 94.386541
2 430.0 56.953642 53.395356 60.441527
3 970.0 70.545455 68.081501 72.894929
4 -1.0 -1.000000 -1.000000 -1.000000
5 70.0 28.571429 23.279572 34.524887
6 125.0 62.500000 55.614304 68.914563
Denominator
0 224794.0
1 305.0
2 755.0
3 1375.0
4 -1.0
5 245.0
6 200.0
Related
I have a csv file with bid/ask prices of many bonds (using ISIN identifiers) for the past 1 yr. Using these historical prices, I'm trying to calculate the historical volatility for each bond. Although it should be typically an easy task, the issue is not all bonds have exactly same number of days of trading price data, while they're all in same column and not stacked. Hence if I need to calculate a rolling std deviation, I can't choose a standard rolling window of 252 days for 1 yr.
The data set has this format-
BusinessDate
ISIN
Bid
Ask
Date 1
ISIN1
P1
P2
Date 2
ISIN1
P1
P2
Date 252
ISIN1
P1
P2
Date 1
ISIN2
P1
P2
Date 2
ISIN2
P1
P2
......
& so on.
My current code is as follows-
vol_df = pd.read_csv('hist_prices.csv')
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df[Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df['hist_vol'] = vol_df['log_return'].std() * np.sqrt(252)
The last line of code seems to be giving all NaN values in the column. This is most likely because the operation for calculating the std deviation is happening on the same row number and not for a list of numbers. I tried replacing the last line to use rolling_std-
vol_df.set_index('BusinessDate').groupby('ISIN').rolling(window = 1, freq = 'A').std()['log_return']
But this doesn't help either. It gives 2 numbers for each ISIN. I also tried to use pivot() to place the ISINs in columns and BusinessDate as index, and the Prices as "values". But it gives an error. Also I've close to 9,000 different ISINs and hence putting them in columns to calculate std() for each column may not be the best way. Any clues on how I can sort this out?
I was able to resolve this in a crude way-
vol_df_2 = vol_df.groupby('ISIN')['logret'].std()
vol_df_3 = vol_df_2.to_frame()
vol_df_3.rename(columns = {'logret':'daily_std}, inplace = True)
The first line above was returning a series and the std deviation column named as 'logret'. So the 2nd and 3rd line of code converts it into a dataframe and renames the daily std deviation as such. And finally the annual vol can be calculated using sqrt(252).
If anyone has a better way to do it in the same dataframe instead of creating a series, that'd be great.
ok this almost works now.
It does need some math per ISIN to figure out the rolling period, I just used 3 and 2 in my example, you probably need to count how many days of trading in the year or whatever and fix it at that per ISIN somehow.
And then you need to figure out how to merge the data back. The output actually has errors becuase its updating a copy, but that is kind of what I was looking for here. I am sure someone that knows more could fix it at this point. I can't get it working to do the merge.
toy_data={'BusinessDate': ['10/5/2020','10/6/2020','10/7/2020','10/8/2020','10/9/2020',
'10/12/2020','10/13/2020','10/14/2020','10/15/2020','10/16/2020',
'10/5/2020','10/6/2020','10/7/2020','10/8/2020'],
'ISIN': [1,1,1,1,1, 1,1,1,1,1, 2,2,2,2],
'Bid': [0.295,0.295,0.295,0.295,0.295,
0.296, 0.296,0.297,0.298,0.3,
2.5,2.6,2.71,2.8],
'Ask': [0.301,0.305,0.306,0.307,0.308,
0.315,0.326,0.337,0.348,0.37,
2.8,2.7,2.77,2.82]}
#vol_df = pd.read_csv('hist_prices.csv')
vol_df = pd.DataFrame(toy_data)
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df['Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df.dropna(subset = ['log_return'], inplace=True)
# do some math here to calculate how many days you want to roll for an ISIN
# maybe count how many days over a 1 year period exist???
# not really sure how you'd miss days unless stuff just doesnt trade
# (but I don't need to understand it anyway)
rolling = {1: 3, 2: 2}
for isin in vol_df['ISIN'].unique():
roll = rolling[isin]
print(f'isin={isin}, roll={roll}')
df_single = vol_df[vol_df['ISIN']==isin]
df_single['rolling'] = df_single['log_return'].rolling(roll).std()
# i can't get the right syntax to merge data back, but this shows it
vol_df[isin, 'rolling'] = df_single['rolling']
print(df_single)
print(vol_df)
which outputs (minus the warning errors):
isin=1, roll=3
BusinessDate ISIN Bid Ask Mid Price log_return rolling
1 2020-10-06 1 0.295 0.305 0.3000 0.006689 NaN
2 2020-10-07 1 0.295 0.306 0.3005 0.001665 NaN
3 2020-10-08 1 0.295 0.307 0.3010 0.001663 0.002901
4 2020-10-09 1 0.295 0.308 0.3015 0.001660 0.000003
5 2020-10-12 1 0.296 0.315 0.3055 0.013180 0.006650
6 2020-10-13 1 0.296 0.326 0.3110 0.017843 0.008330
7 2020-10-14 1 0.297 0.337 0.3170 0.019109 0.003123
8 2020-10-15 1 0.298 0.348 0.3230 0.018751 0.000652
9 2020-10-16 1 0.300 0.370 0.3350 0.036478 0.010133
isin=2, roll=2
BusinessDate ISIN Bid ... log_return (1, rolling) rolling
11 2020-10-06 2 2.60 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.71 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.80 ... 2.522656e-02 NaN 0.005778
[3 rows x 8 columns]
BusinessDate ISIN Bid ... log_return (1, rolling) (2, rolling)
1 2020-10-06 1 0.295 ... 6.688988e-03 NaN NaN
2 2020-10-07 1 0.295 ... 1.665279e-03 NaN NaN
3 2020-10-08 1 0.295 ... 1.662511e-03 0.002901 NaN
4 2020-10-09 1 0.295 ... 1.659751e-03 0.000003 NaN
5 2020-10-12 1 0.296 ... 1.317976e-02 0.006650 NaN
6 2020-10-13 1 0.296 ... 1.784313e-02 0.008330 NaN
7 2020-10-14 1 0.297 ... 1.910886e-02 0.003123 NaN
8 2020-10-15 1 0.298 ... 1.875055e-02 0.000652 NaN
9 2020-10-16 1 0.300 ... 3.647821e-02 0.010133 NaN
11 2020-10-06 2 2.600 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.710 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.800 ... 2.522656e-02 NaN 0.005778
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN
I have a dataframe with two numeric columns. I want to add a third column to calculate the difference. But the condition is if the values in the first column are blank or Nan, the difference should be the value in the second column...
Can anyone help me with this problem?
Any suggestions and clues will be appreciated!
Thank you.
You should use vectorised operations where possible. Here you can use numpy.where:
df['Difference'] = np.where(df['July Sales'].isnull(), df['August Sales'],
df['August Sales'] - df['July Sales'])
However, consider this is precisely the same as considering NaN values in df['July Sales'] to be equal to zero. So you can use pd.Series.fillna:
df['Difference'] = df['August Sales'] - df['July Sales'].fillna(0)
This isn't really a situation with conditions, it is just a math operation.. Suppose you have the df:
consider your df using the .sub() method:
df['Diff'] = df['August Sales'].sub(df['July Sales'], fill_value=0)
returns output:
July Sales August Sales Diff
0 459.0 477 18.0
1 422.0 125 -297.0
2 348.0 483 135.0
3 397.0 271 -126.0
4 NaN 563 563.0
5 191.0 325 134.0
6 435.0 463 28.0
7 NaN 479 479.0
8 475.0 473 -2.0
9 284.0 496 212.0
Used a sample dataframe, but it shouldn't be hard to comprehend:
df = pd.DataFrame({'A': [1, 2, np.nan, 3], 'B': [10, 20, 30, 40]})
def diff(row):
return row['B'] if (pd.isnull(row['A'])) else (row['B'] - row['A'])
df['C'] = df.apply(diff, axis=1)
ORIGINAL DATAFRAME:
A B
0 1.0 10
1 2.0 20
2 NaN 30
3 3.0 40
AFTER apply:
A B C
0 1.0 10 9.0
1 2.0 20 18.0
2 NaN 30 30.0
3 3.0 40 37.0
try this:
def diff(row):
if not row['col1']:
return row['col2']
else:
return row['col1'] - row['col2']
df['col3']= df.apply(diff, axis=1)
I have created two series and I want to create a third series by doing element-wise multiplication of first two. My code is given below:
new_samples = 10 # Number of samples in series
a = pd.Series([list(map(lambda x:x,np.linspace(2,2,new_samples)))],index=['Current'])
b = pd.Series([list(map(lambda x:x,np.linspace(10,0,new_samples)))],index=['Voltage'])
c = pd.Series([x*y for x,y in zip(a.tolist(),b.tolist())],index=['Power'])
My output is:
TypeError: can't multiply sequence by non-int of type 'list'
To keep things clear, I am pasting my actual for loop code below. My data frame already has three columns Current,Voltage,Power. For my requirement, I have to add new list of values to existing columns Voltage,Current. But, Power values are created by multiplying already created values. My code is given below:
for i,j in zip(IV_start_index,IV_start_index[1:]):
isc_act = module_allData_df['Current'].iloc[i:j-1].max()
isc_indx = module_allData_df['Current'].iloc[i:j-1].idxmax()
sample_count = int((j-i)/(module_allData_df['Voltage'].iloc[i]-module_allData_df['Voltage'].iloc[j-1]))
new_samples = int(sample_count * (module_allData_df['Voltage'].iloc[isc_indx]))
missing_current = pd.Series([list(map(lambda x:x,np.linspace(isc_act,isc_act,new_samples)))],index=['Current'])
missing_voltage = pd.Series([list(map(lambda x:x,np.linspace(module_allData_df['Voltage'].iloc[isc_indx],0,new_samples)))],index=['Voltage'])
print(missing_current.tolist()*missing_voltage.tolist())
Sample data: module_allData_df.head()
Voltage Current Power
0 33.009998 -0.004 -0.13204
1 33.009998 0.005 0.16505
2 32.970001 0.046 1.51662
3 32.950001 0.087 2.86665
4 32.919998 0.128 4.21376
sample data: module_allData_df.iloc[120:126] and you require this also
Voltage Current Power
120 0.980000 5.449 5.34002
121 0.920000 5.449 5.01308
122 0.860000 5.449 4.68614
123 0.790000 5.449 4.30471
124 33.110001 -0.004 -0.13244
125 33.110001 0.005 0.16555
sample data: IV_start_index[:5]
[0, 124, 251, 381, 512]
Based on #jezrael answer, I have successfully created three separate series. How to append them to main dataframe. My requirement is explained in following plot.
Problem is each Series is one element with lists, so not possible use vectorized operations.
a = pd.Series([list(map(lambda x:x,np.linspace(2,2,new_samples)))],index=['Current'])
b = pd.Series([list(map(lambda x:x,np.linspace(10,0,new_samples)))],index=['Voltage'])
print (a)
Current [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
dtype: object
print (b)
Voltage [10.0, 8.88888888888889, 7.777777777777778, 6....
dtype: object
So I believe need remove [] and if necessary add parameter name:
a = pd.Series(list(map(lambda x:x,np.linspace(2,2,new_samples))), name='Current')
b = pd.Series(list(map(lambda x:x,np.linspace(10,0,new_samples))),name='Voltage')
print (a)
0 2.0
1 2.0
2 2.0
3 2.0
4 2.0
5 2.0
6 2.0
7 2.0
8 2.0
9 2.0
Name: Current, dtype: float64
print (b)
0 10.000000
1 8.888889
2 7.777778
3 6.666667
4 5.555556
5 4.444444
6 3.333333
7 2.222222
8 1.111111
9 0.000000
Name: Voltage, dtype: float64
c = a * b
print (c)
0 20.000000
1 17.777778
2 15.555556
3 13.333333
4 11.111111
5 8.888889
6 6.666667
7 4.444444
8 2.222222
9 0.000000
dtype: float64
EDIT:
If want outoput multiplied Series need last 2 rows:
missing_current = pd.Series(list(map(lambda x:x,np.linspace(isc_act,isc_act,new_samples))))
missing_voltage = pd.Series(list(map(lambda x:x,np.linspace(module_allData_df['Voltage'].iloc[isc_indx],0,new_samples))))
print(missing_current *missing_voltage)
It's easier using numpy.
import numpy as np
new_samples = 10 # Number of samples in series
a = np.array(np.linspace(2,2,new_samples))
b = np.array(np.linspace(10,0,new_samples))
c = a*b
print(c)
Output:
array([20. , 17.77777778, 15.55555556, 13.33333333,
11.11111111,
8.88888889, 6.66666667, 4.44444444, 2.22222222, 0. ])
As you are doing everything using pandas dataframe, use the below code.
import pandas as pd
new_samples = 10 # Number of samples in series
df = pd.DataFrame({'Current':np.linspace(2,2,new_samples),'Voltage':np.linspace(10,0,new_samples)})
df['Power'] = df['Current'] * df['Voltage']
print(df.to_string(index=False))
Output:
Current Voltage Power
2.0 10.000000 20.000000
2.0 8.888889 17.777778
2.0 7.777778 15.555556
2.0 6.666667 13.333333
2.0 5.555556 11.111111
2.0 4.444444 8.888889
2.0 3.333333 6.666667
2.0 2.222222 4.444444
2.0 1.111111 2.222222
2.0 0.000000 0.000000
Because they are series you should be able to just multiply them c= a * b
You could add a and b to a data frame and the c becomes the third column
mean chart:
interval gross(mean)
(1920, 1925] NaN
(1925, 1930] 3.443000e+06
(1930, 1935] 4.746000e+05
(1935, 1940] 2.011249e+06
i have a huge dataframe(df) which has some Nan values in gross columns
Now i want to fill those Nan values from mean chart according to respective interval.
df:
name gross interval
k 1000 (1935, 1940]
l Nan (1950, 1955]
,,,
here interval is categorical index.
You can add a column to the dataframe with the corresponding mean value using your mean chart (you can do a left join using pd.merge by joining on the interval column). Once you have this column, you can use -
df['gross'].fillna(df['means'])
You can create new Series by map and then replace NaNs by combine_first.
Main advantage is no necessary helper column, which is necessary remove later.
df1=pd.DataFrame({'gross(mean)':[np.nan,3.443000e+06, 4.746000e+05, 2.011249e+06, 10,20,30],
'interval':[1922,1927,1932, 1938,1932,1938,1953]})
df1['interval'] = pd.cut(df1['interval'], bins=[1920,1925,1930,1935,1940,1945,1950,1955])
print (df1)
gross(mean) interval
0 NaN (1920, 1925]
1 3443000.0 (1925, 1930]
2 474600.0 (1930, 1935]
3 2011249.0 (1935, 1940]
4 10.0 (1930, 1935]
5 20.0 (1935, 1940]
6 30.0 (1950, 1955]
df = pd.DataFrame({'name':['k','l'],
'gross':[1000, np.nan],
'interval':[1938, 1952]}, columns=['name','gross','interval'])
df['interval'] = pd.cut(df['interval'], bins=[1925,1930,1935,1940,1945,1950,1955])
print (df)
name gross interval
0 k 1000.0 (1935, 1940]
1 l NaN (1950, 1955]
mapped = df['interval'].map(df1.set_index('interval')['gross(mean)'].to_dict())
print (mapped)
0 20.0
1 30.0
Name: interval, dtype: float64
df['gross'] = df['gross'].combine_first(mapped)
print (df)
name gross interval
0 k 1000.0 (1935, 1940]
1 l 30.0 (1950, 1955]