Combine daily data into monthly data in Excel using Python - python

I am trying to figure out how I can combine daily dates into specific months and summing the data for the each day that falls within the specific month.
Note: I have a huge list with daily dates but I put a small sample here to simply the example.
File name: (test.xlsx)
For an Example (sheet1) contains in dataframe mode:
DATE 51 52 53 54 55 56
0 20110706 28.52 27.52 26.52 25.52 24.52 23.52
1 20110707 28.97 27.97 26.97 25.97 24.97 23.97
2 20110708 28.52 27.52 26.52 25.52 24.52 23.52
3 20110709 28.97 27.97 26.97 25.97 24.97 23.97
4 20110710 30.5 29.5 28.5 27.5 26.5 25.5
5 20110711 32.93 31.93 30.93 29.93 28.93 27.93
6 20110712 35.54 34.54 33.54 32.54 31.54 30.54
7 20110713 33.02 32.02 31.02 30.02 29.02 28.02
8 20110730 35.99 34.99 33.99 32.99 31.99 30.99
9 20110731 30.5 29.5 28.5 27.5 26.5 25.5
10 20110801 32.48 31.48 30.48 29.48 28.48 27.48
11 20110802 31.04 30.04 29.04 28.04 27.04 26.04
12 20110803 32.03 31.03 30.03 29.03 28.03 27.03
13 20110804 34.01 33.01 32.01 31.01 30.01 29.01
14 20110805 27.44 26.44 25.44 24.44 23.44 22.44
15 20110806 32.48 31.48 30.48 29.48 28.48 27.48
What I would like is to edit ("test.xlsx",'sheet1') to result in what is below:
DATE 51 52 53 54 55 56
0 201107 313.46 303.46 293.46 283.46 273.46 263.46
1 201108 189.48 183.48 177.48 171.48 165.48 159.48
How would I go about implementing this?
Here is my code thus far:
import pandas as pd
from pandas import ExcelWriter
df = pd.read_excel('thecddhddtestquecdd.xlsx')
def sep_yearmonths(x):
x['month'] = str(x['DATE'])[:-2]
return x
df = df.apply(sep_yearmonths,axis=1)
df.groupby('month').sum()
writer = ExcelWriter('thecddhddtestquecddMERGE.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()

This will work if 'DATE' is a column of strings and not your index.
Example dataframe - shortened for clarity:
df = pd.DataFrame({'DATE': {0: '20110706', 1:'20110707', 2: '20110801'},
52: {0: 28.52, 1: 28.97, 2: 28.52},
55: { 0: 24.52, 1: 24.97, 2:24.52 }
})
Which yields:
52 55 DATE
0 28.52 24.52 20110706
1 28.97 24.97 20110707
2 28.52 24.52 20110801
Apply the following function over the dataframe to generate a new column:
def sep_yearmonths(x):
x['month'] = x['DATE'][:-2]
return x
Like this:
df = df.apply(sep_yearmonths,axis=1)
Over which you can then groupby and sum:
df.groupby('month').sum()
Resulting in the following:
52 55
month
201107 57.49 49.49
201108 28.52 24.52
If 'date' is your index, simply call reset_index before. If it's not a column of string values, then you need to do that beforehand.
Finally, you can rename your 'month' column to 'DATE'. I suppose you could just substitute the column 'DATE' inplace, but I choose to do things explictly. You can do that like so:
df['DATE'] = df['DATE'].apply(lambda x: x[:-2])
Then 'groupby' 'DATE' instead of month.

Use resample
import pandas as pd
myTable=pd.read_excel('test.xlsx')
myTable['DATE']=pd.to_datetime(myTable['DATE'], format="%Y%m%d")
myTable=myTable.set_index('DATE')
myTable.resample("M").sum()

Related

Mean of values in some columns with Pandas/Numpy

I've just started with Pandas and Numpy a couple of months ago and I've learned already quite a lot thanks to all the threads here. But now I can't find what I need.
For work, I have created an excel sheet that calculates some figures to be used for re-ordering inventory. To practice and maybe actually use it, I'd wanted to give it a try to replicate the functionality in Python. Later I might want to add some more sophisticated calculations with the help of Scikit-learn.
So far I've managed to load a csv with sales figures from our ERP into a dataframe, calculate mean and std. The calculations have been done on a subset of the data because I don't know how to apply calculations only to the specific columns. The csv does also contain for example product codes and leadtimes and these should not be used for the average and std calculations. Not sure yet also how to merge this subset back with the original dataframe.
The reason why I didn't hardcode the column names is because the ERP reports the sales number over the past x no. of months, so the order of the columns will change througout the year and I want to keep them in chronological order.
My data from the csv looks like:
"code","leadtime","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"
"001.002",60,299,821,351,614,246,957,968,939,125,368,727,231
"001.002",25,340,274,733,575,904,953,614,268,638,960,617,757
"001.002",130,394,327,435,767,377,699,424,951,972,717,317,264
What I've done so far and what is working fine. (This can be doe probably much easier/more efficient):
import numpy as np
import timeit
import csv
import pandas as pd
sd = 1
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Get no of columns and substract 2 for compcode and leadtime
cols = df.shape[1] - 2
# Create a subset and count the columns
df_subset = df.iloc[:, -cols:]
subset_cols = df_subset.shape[1]
# Add columns for std dev and average
df_subset = (df_subset.assign(mean=df_subset.mean(axis=1),
stddev=df_subset.std(axis=1, ddof=0))
)
# Add columns for min and max values based on mean +/- std multiplied by factor sd
df_subset = (df_subset.assign(minSD=df_subset['mean'].sub(df_subset['stddev'] * sd),
maxSD=df_subset['mean'].add(df_subset['stddev'] * sd))
df_subset
Which gives me:
jan feb mar apr may jun jul aug sep oct nov dec mean stddev minSD maxSD
0 299 821 351 614 246 957 968 939 125 368 727 231 553.833333 304.262998 249.570335 858.096332
1 340 274 733 575 904 953 614 268 638 960 617 757 636.083333 234.519530 401.563804 870.602863
2 394 327 435 767 377 699 424 951 972 717 317 264 553.666667 242.398203 311.268464 796.064870
However for my next calculation I'm stuck again:
I want to calculate the average over values from the "month" columns and only the values that match the condition >= minSD and <= maxSD
So for row 0, I'm looking for the value (299+821+351+614+368+727)/6 = 530
How can I achieve this?
I've tried this, but this doesn't seem to work:
df_subset = df_subset.assign(avgwithSD=df_subset.iloc[:,0:subset_cols].values(where(df_subset.values>=df_subset['minSD'] & df_subset.values>=df_subset['maxSD'])).mean(axis=1))
Some help would be very welcome. Thanks
EDIT: With help I ended up using this to get further with my program
import numpy as np
import timeit
import csv
import pandas as pd
# sd will determine if range will be SD1 or SD2
sd = 1
# file to use
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Function to calculate the mean of the values within the range between minSD and maxSD
def CalcMeanSD(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Define the month/data columns and set them to floatvalues
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
# Add columns for stddev and mean. Based on these values set new range between minSD and maxSD
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# Add column with the mean of the new range
df['avgwithSD'] = np.nanmean(df.apply(CalcMeanSD, axis=1), axis=1)
df
Result is:
code leadtime jan feb mar apr may jun jul aug sep oct nov dec stddev mean minSD maxSD avgwithSD
0 001.002 60 299.0 821.0 351.0 614.0 246.0 957.0 968.0 939.0 125.0 368.0 727.0 231.0 304.262998 553.833333 249.570335 858.096332 530.000000
1 001.002 25 340.0 274.0 733.0 575.0 904.0 953.0 614.0 268.0 638.0 960.0 617.0 757.0 234.519530 636.083333 401.563804 870.602863 655.666667
2 001.002 130 394.0 327.0 435.0 767.0 377.0 699.0 424.0 951.0 972.0 717.0 317.0 264.0 242.398203 553.666667 311.268464 796.064870 495.222222
3 001.002 90 951.0 251.0 411.0 469.0 359.0 220.0 192.0 250.0 818.0 768.0 937.0 128.0 292.572925 479.500000 186.927075 772.072925 365.000000
4 001.002 35 228.0 400.0 46.0 593.0 61.0 293.0 5.0 203.0 850.0 506.0 37.0 631.0 264.178746 321.083333 56.904588 585.262079 281.833333
5 001.002 10 708.0 804.0 208.0 380.0 531.0 125.0 500.0 773.0 354.0 238.0 805.0 215.0 242.371773 470.083333 227.711560 712.455106 451.833333
6 001.002 14 476.0 628.0 168.0 946.0 29.0 324.0 3.0 400.0 981.0 467.0 459.0 571.0 295.814225 454.333333 158.519109 750.147558 436.625000
7 001.002 14 92.0 906.0 18.0 537.0 57.0 399.0 544.0 977.0 909.0 687.0 881.0 459.0 333.154577 538.833333 205.678756 871.987910 525.200000
8 001.002 90 487.0 634.0 5.0 918.0 158.0 447.0 713.0 459.0 465.0 643.0 482.0 672.0 233.756447 506.916667 273.160220 740.673113 555.777778
9 001.002 130 741.0 43.0 976.0 461.0 35.0 321.0 434.0 8.0 330.0 32.0 896.0 531.0 326.216782 400.666667 74.449885 726.883449 415.400000
EDIT:
Instead of your original code:
# first part:
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# second part: (the one that doesn't work for you)
def calc_mean_per_row_by_condition(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
df['avgwithSD'] = np.nanmean(df.apply(calc_mean_per_row_by_condition, axis=1), axis=1)

Accessing columns without name and deleting certain data from dataframe

I have a dataframe which has 14 columns and it does not have any headers in it. The first column is date and I need to delete all rows which is older than 25 months from now and does not contain any date.
44.93442 -79.37061 Tow 36 45.06541 -79.43384 R103 2053 WL7 CCG
44.96092 -78.39428 Flatbed Tow 56 0 0 P201 3040 SHOP CCG
45.05056 -78.50632 Jump Start/Battery Test 30 45.12152 -78.61311 P201 1426 FB1 CCG
45.15709 -77.91703 Winch/Extrication 20 0 0 K143 957 SHOP CCG
45.24038 -81.64227 Tow 36 45.25674 -81.65996 W444 1597 WL1 CCG
45.32589 -78.98001 Winch/Extrication 79 0 0 R105 43 SHOP CCG
45.33402 -79.20586 Tow 38 45.32871 -79.21218 R103 226 WL10 CCG
46.2062 -82.47623 Tow 68 46.18153 -82.96333 R812 588 FM1 CCG
46.50164 -84.28004 Winch/Extrication 36 46.53398 -84.35199 R829 10 WL1 CCG
46.50694 -84.32854 No service 27 46.53578 -84.38477 R829 124 WL1 CCG
46.51246 -84.33669 No service 728 46.50897 -84.3295 R829 123 FB1 CCG
46.52354 -84.38171 Flatbed Tow 66 46.51035 -84.25437 R829 49 FB2 CCG
2017-11-11 00:03:10 43.68109 -79.53289 Tow 33 43.54982 -79.71832 C1 2 WL2 CCG 00:05:50
2017-11-11 00:04:09 43.91352 -78.9673 Tow 24 43.93242 -78.86989 C207 4 M255 CCG 00:04:11
2017-11-11 00:05:09 42.93152 -81.19933 Tow 161 42.91458 -81.2216 G72 5 FB6 CCG P5 00:16:21
2017-11-11 00:05:14 44.26861 -76.57645 Jump Start/Battery Test 21 44.26594 -76.49862 K112 6 LS1 CCG 00:12:23
2017-11-11 00:07:09 43.30024 -79.90106 Flatbed Tow 53 43.30506 -79.90795 H935 8 350 CCG P5 00:14:35
2017-11-11 00:07:14 43.71246 -79.55377 Tow 40 43.7081 -79.5536 C100 9 WL25 CCG P2 00:07:22
I have tried this so far but I did not get any success yet.
import pandas as pd
from datetime import datetime,timedelta
#find the 25 month prior month from now
N = 761
today = datetime.now()
lastMonth = today-timedelta(days=N)
month24= lastMonth.strftime("%Y-%m")
print (month24)
data = pd.read_csv("E:/ERS_DATA_TEMP.txt",delimiter='\t', dtype=str)
#access the dataframe and column
x=data.iloc[:,0]
print (x)
#remove data
data.drop(data[data.iloc[:,0]],[np.NaN]>=month24)
y=len(data.index)
Read in data without headers
df = pd.read_csv("E:/ERS_DATA_TEMP.txt",delimiter='\t', dtype=str, header=None)
Drop all rows without time
df = df[df[0].apply(pd.to_datetime, errors='coerce').notnull()]
Finally slice on dates
df = df[df[0] > month24]
You can use iloc accessor to operate on the column:
# convert month to days
n_days = 25 * 30
# both will return a boolean series
t1 = df.iloc[:,0].apply(lambda x: (x - pd.to_datetime('today')).days).gt(n_days)
t2 = df.iloc[0].isna()
# remove unwanted dates
df1 = df.loc[t1 & t2]
Sample Data
period = pd.date_range(start='20130101', freq='M', periods=1000)
df = pd.DataFrame({'period': period})

Parsing and create a new df with conditions

I need some help with python and pandas.
I actually have a dataframe with in the column seq1_id al the seq_id of sequences of the species 1 and the column 2 for the sequences of the sp2.
I actually passed a filter on those sequences and got two dataframes (one with all sequences of sp 1 passed through the filter) and (one with all sequences of sp2 passed through the filter).
Then I have 3 dataframes.
Because in a pairs, one seq can pass the filter while the other does not, it is important to keep only paired genes which are keeping on the two previous filtering, so what I need to do is actually to parse my first df such this one:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
seq3_A Seq10_B
seq4_A Seq11_B
and check row by row if (ex the first row) seq1_A is present in the df2 and if seq8_B is also present in the df3, then keep this row in the df1 and add it in a new df4.
Here is an example with output wanted:
first df:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
seq3_A Seq10_B
seq4_A Seq11_B
df2 (sp1) (seq3_A is absent)
Seq_1.id
seq1_A
seq2_A
seq4_A
df3 (sp2) (Seq11_B is absent)
Seq_2.id
seq8_B
Seq9_B
Seq10_B
Then because Seq11_B and seq3_A are not present, the df4 (output) would be:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
candidates_0035=pd.read_csv("candidates_genes_filtering_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_filtering_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')
df4 =dN_dS[dN_dS['seq1_id'].isin(candidates_0042['gene'])&dN_dS['seq2_id'].isin(candidates_0035['gene'])]
and I got an empty output, only with columns names but it should not be like that.
Here are the data if you cant to test the code on it :
df1:
Unnamed: 0 seq1_id seq2_id dN dS Dist_third_pos Dist_brute Length_seq_1 Length_seq_2 GC_content_seq1 GC_content_seq2 GC Mean_length
0 0 g66097.t1_0035_0035 g13600.t1_0042_0042 0.10455938989199982 0.3122332927029104 0.23600000000000002 0.142 535.0 1024.0 49.1588785046729 51.171875 50.165376752336456 535.0
1 1 g45594.t1_0035_0035 g1464.t1_0042_0042 0.5208761055250978 5.430485421797574 0.7120000000000001 0.489 246.0 222.0 47.967479674796756 44.594594594594604 46.28103713469567 222.0
2 2 g50055.t1_0035_0035 g34744.t1_0042_0035 0.08040473491714645 0.4233916132491867 0.262 0.139 895.0 749.0 56.312849162011176 57.67690253671562 56.994875849363396 749.0
3 3 g34020.t1_0035_0035 g12096.t1_0042_0042 0.4385191689737516 26.834927363887587 0.5760000000000001 0.433 597.0 633.0 37.85594639865997 39.810426540284354 38.83318646947217 597.0
4 4 g28436.t1_0035_0042 g35222.t1_0042_0035 0.055299811368483165 0.1181241496387666 0.1 0.069 450.0 461.0 45.111111111111114 44.90238611713666 45.006748614123886 450.0
5 5 g1005.t1_0035_0035 g11524.t1_0042_0042 0.3528036631463747 19.32549458735676 0.71 0.512 3177.0 3804.0 39.06200818382121 52.944269190325976 46.0031386870736 3177.0
6 6 g28456.t1_0035_0035 g31669.t1_0042_0035 0.4608959702286786 26.823981621115166 0.6859999999999999 0.469 516.0 591.0 49.224806201550386 53.46869712351946 51.346751662534935 516.0
7 7 g6202.t1_0035_0035 g193.t1_0042_0042 0.4679458383555545 17.81312422445775 0.66 0.462 804.0 837.0 41.91542288557214 47.67025089605735 44.79283689081474 804.0
8 8 g60667.t1_0035_0035 g14327.t1_0042_0042 0.046056273155280165 0.13320612138898 0.122 0.067 348.0 408.0 56.89655172413793 55.392156862745104 56.1443542934415 348.0
9 9 g30148.t1_0035_0042 g37790.t1_0042_0035 0.05631607180881047 0.19747150378706246 0.12300000000000001 0.08800000000000001 405.0 320.0 59.012345679012356 58.4375 58.72492283950618 320.0
10 10 g24481.t1_0035_0035 g37405.t1_0042_0035 0.2151957757290965 0.15106487998618026 0.135 0.17600000000000002 270.0 276.0 51.111111111111114 51.44927536231884 51.28019323671497 270.0
11 11 g33270.t1_0035_0035 g21201.t1_0042_0035 0.2773062983971916 21.13839474189674 0.6940000000000001 0.401 297.0 357.0 54.882154882154886 50.42016806722689 52.65116147469089 297.0
12 12 EOG090X03YJ_0035_0035_1 EOG090X03YJ_0042_0042_1 0.5402471721616758 19.278839157918302 0.7070000000000001 0.488 1321.0 1719.0 38.53141559424678 43.92088423502036 41.22614991463357 1321.0
13 13 g13075.t1_0035_0042 g504.t1_0042_0035 0.3317504066721263 4.790120127840871 0.65 0.38799999999999996 372.0 408.0 59.40860215053763 51.470588235294116 55.43959519291587 372.0
14 14 g1026.t1_0035_0035 g7716.t1_0042_0042 0.21445770772761286 13.92799368027682 0.626 0.344 336.0 315.0 38.095238095238095 44.444444444444436 41.26984126984127 315.0
15 15 g18238.t1_0035_0042 g35401.t1_0042_0035 0.3889830456691637 20.33679494952895 0.6759999999999999 0.44799999999999995 320.0 366.0 50.9375 49.453551912568315 50.19552595628416 320.0
df2:
Unnamed: 0 gene scaf_name start end cov_depth GC
179806 g13600.t1_0042_0042 scaffold_6556 1 1149 2.42361684558216 0.528846153846154
315037 g34744.t1_0042_0035 scaffold_8076 17 765 3.49803921568627 0.386138613861386
317296 g35222.t1_0042_0035 scaffold_9018 1 614 93.071661237785 0.41
183513 g14327.t1_0042_0042 scaffold_9358 122 529 3.3184165232357996 0.36
328164 g37790.t1_0042_0035 scaffold_16356 1 320 2.73125 0.436241610738255
326617 g37405.t1_0042_0035 scaffold_14890 1 341 1.3061224489795902 0.36898395721925104
188515 g15510.t1_0042_0042 scaffold_20183 1 276 137.326086956522 0.669354838709677
184561 g14562.t1_0042_0042 scaffold_10427 1 494 157.993927125506 0.46145940390544704
290684 g30982.t1_0042_0035 scaffold_3800 440 940 174.499839537869 0.39823008849557506
179993 g13632.t1_0042_0042 scaffold_6654 29 1114 3.56506849315068 0.46153846153846206
181670 g13942.t1_0042_0042 scaffold_7830 1 811 5.307028360049321 0.529411764705882
196148 g20290.t1_0042_0035 scaffold_1145 2707 9712 78.84112231766741 0.367283950617284
313624 g34464.t1_0042_0035 scaffold_7610 1 480 7.740440324449589 0.549019607843137
303133 g32700.t1_0042_0035 scaffold_5119 1735 2373 118.436578171091 0.49074074074074103
df3:
Unnamed: 0 gene scaf_name start end cov_depth GC
428708 g66097.t1_0035_0035 scaffold_306390 1 695 32.2431654676259 0.389880952380952
342025 g50055.t1_0035_0035 scaffold_188566 15 954 7.062893081761009 0.351129363449692
214193 g28436.t1_0035_0042 scaffold_231066 1 842 25.9774346793349 0.348837209302326
400337 g60667.t1_0035_0035 scaffold_261197 309 656 15.873529411764698 0.353846153846154
224023 g30148.t1_0035_0042 scaffold_263686 10 414 23.2072538860104 0.34108527131782895
184987 g24481.t1_0035_0035 scaffold_65047 817 1593 27.7840552416824 0.533898305084746
249413 g34492.t1_0035_0035 scaffold_106432 1 511 3.2482544608223396 0.368318122555411
249418 g34493.t1_0035_0035 scaffold_106432 547 1230 3.2482544608223396 0.368318122555411
12667 g1120.t1_0035_0042 scaffold_2095 2294 2794 47.864745898359295 0.56203288490284
252797 g35042.t1_0035_0035 scaffold_108853 274 1276 20.269592476489 0.32735426008968604
255878 g36112.t1_0035_0042 scaffold_437464 1 540 74.8252551020408 0.27884615384615397
40058 g4082.t1_0035_0042 scaffold_11195 579 1535 33.4396168320219 0.48487467588591204
271053 g39343.t1_0035_0042 scaffold_590976 1 290 19.6666666666667 0.38636363636363596
89911 g10947.t1_0035_0035 scaffold_21433 1735 2373 32.4222503160556 0.408571428571429
This should do it:
df4 = df1[df1['Seq_1.id'].isin(df2['Seq_1.id'])&df1['Seq_2.id'].isin(df3['Seq_2.id'])]
df4
# Seq_1.id Seq_2.id
#0 seq1_A seq8_B
#1 seq2_A Seq9_B
EDIT
You must have made a permutation, this doesn't return empty:
df4 = dN_dS[(dN_dS['seq1_id'].isin(candidates_0035['gene']))&(dN_dS['seq2_id'].isin(candidates_0042['gene']))]

Assigning (or tieing in) function results back to original data in pandas

I am struggling with extracting the regression coefficients once I complete the function call np.polyfit (actual code below). I am able to get a display of each coefficient but am unsure how to actually extract them for future use with the original data.
df=pd.read_csv('2_skews.csv')
Here is a head() of the data
date expiry symbol strike vol
0 6/10/2015 1/19/2016 IBM 50 42.0
1 6/10/2015 1/19/2016 IBM 55 41.5
2 6/10/2015 1/19/2016 IBM 60 40.0
3 6/10/2015 1/19/2016 IBM 65 38.0
4 6/10/2015 1/19/2016 IBM 70 36.0
There are many symbols with many strikes across many days and many expiry dates as well
I have grouped the data by date, symbol and expiry and then call the regression function with this:
df_reg=df.groupby(['date','symbol','expiry']).apply(regress)
I have this function that seems to work well (gives proper coefficients), i just don't seem to be able to access them and tie them to the original data.
def regress(df):
y=df['vol']
x=df['strike']
z=P.polyfit(x,y,4)
return (z)
I am calling polyfit like this:
from numpy.polynomial import polynomial as P
The final results:
df_reg
date symbol expiry
5/19/2015 GS 1/19/2016 [-112.064833151, 6.76871521993, -0.11147562136...
3/21/2016 [-131.2914493, 7.16441276062, -0.1145534833, 0...
IBM 1/19/2016 [211.458028147, -5.01236287512, 0.044819313514...
3/21/2016 [-34.1027973807, 3.16990194634, -0.05676206572...
6/10/2015 GS 1/19/2016 [50.3916788503, 0.795484227762, -0.02701849495...
3/21/2016 [31.6090441114, 0.851878910113, -0.01972772270...
IBM 1/19/2016 [-13.6159660078, 3.23002791603, -0.06015739505...
3/21/2016 [-51.6709051223, 4.80288173687, -0.08600312989...
dtype: object
the top results has the functional form of :
y = -0.000002x4 + 0.000735x3 - 0.111476x2 + 6.768715x - 112.064833
I have tried to take the constructive criticism of previous individuals and make my question as clear as possible, please let me know if i still need to work on this :-)
John
Changing the output of regress to a Series rather than a numpy array will give you a data frame when you groupby. The index of the series will be the column names:
In [37]:
df = pd.DataFrame(
[[ '6/10/2015', '1/19/2016', 'IBM', 50, 42.0],
[ '6/10/2015', '1/19/2016', 'IBM', 55, 41.5],
[ '6/10/2015', '1/19/2016', 'IBM', 60, 40.0],
[ '6/10/2015', '1/19/2016', 'IBM', 65, 38.0],
[ '6/10/2015', '1/19/2016', 'IBM', 70, 36.0]],
columns=['date', 'expiry', 'symbol', 'strike', 'vol'])
def regress(df):
y=df['vol']
x=df['strike']
z=np.polyfit(x,y,4)
return pd.Series(z, name='order', index=range(5)[::-1])
group_cols = ['date', 'expiry', 'symbol']
coeffs = df.groupby(group_cols).apply(regress)
coeffs
Out[40]:
order 4 3 2 1 0
date expiry symbol
6/10/2015 1/19/2016 IBM -5.388312e-18 0.000667 -0.13 8.033333 -118
To get the columns containing the coefficients for each combination of date, expiry and symbol you can then merge df and coeffs on these columns:
In [25]: df.merge(coeffs.reset_index(), on=group_cols)
Out[25]:
date expiry symbol strike vol 4 3 2 1 0
0 6/10/2015 1/19/2016 IBM 50 42.0 -6.644454e-18 0.000667 -0.13 8.033333 -118
1 6/10/2015 1/19/2016 IBM 55 41.5 -6.644454e-18 0.000667 -0.13 8.033333 -118
2 6/10/2015 1/19/2016 IBM 60 40.0 -6.644454e-18 0.000667 -0.13 8.033333 -118
3 6/10/2015 1/19/2016 IBM 65 38.0 -6.644454e-18 0.000667 -0.13 8.033333 -118
4 6/10/2015 1/19/2016 IBM 70 36.0 -6.644454e-18 0.000667 -0.13 8.033333 -118
You can then do something like
df = df.merge(coeffs.reset_index(), on=group_cols)
strike_powers = pd.DataFrame(dict((i, df.strike**i) for i in range(5))
df['modelled_vol'] = (strike_powers * df[range(5)]).sum(axis=1)

Python Pandas interpolate with new x-axis

I want to do interpolation for a Pandas series of the following structure
X
22.88 3.047
45.75 3.215
68.63 3.328
91.50 3.423
114.38 3.516
137.25 3.578
163.40 3.676
196.08 3.756
228.76 3.861
261.44 3.942
294.12 4.012
326.80 4.084
359.48 4.147
392.16 4.197
Name: Y, dtype: float64
I want to interpolate the data so that I have a new series to cover X=[23:392:1]. I looked up the document but didn't find where I could input the new x-axis. Did I miss something? How can I do interpolation with the new x-axis?
This can be done with pandas's reindex and interpolate:
In [27]: s
Out[27]:
1
0
22.88 3.047
45.75 3.215
68.63 3.328
91.50 3.423
114.38 3.516
137.25 3.578
163.40 3.676
196.08 3.756
228.76 3.861
261.44 3.942
294.12 4.012
326.80 4.084
359.48 4.147
392.16 4.197
[14 rows x 1 columns]
In [28]: idx = pd.Index(np.arange(23, 392))
In [29]: s.reindex(s.index + idx).interpolate(method='values')
Out[29]:
1
22.88 3.047000
23.00 3.047882
24.00 3.055227
25.00 3.062573
26.00 3.069919
27.00 3.077265
28.00 3.084611
29.00 3.091957
30.00 3.099303
31.00 3.106648
32.00 3.113994
33.00 3.121340
34.00 3.128686
35.00 3.136032
36.00 3.143378
37.00 3.150724
38.00 3.158070
39.00 3.165415
40.00 3.172761
41.00 3.180107
42.00 3.187453
43.00 3.194799
44.00 3.202145
45.00 3.209491
45.75 3.215000
46.00 3.216235
47.00 3.221174
48.00 3.226112
The idea is the create the index you want (s.index + idx), which is sorted automatically, reindex an that (which makes a bunch of NaNs at the new points, and the interpolate to fill the NaNs, using the values method, which interpolates at the index points.
You can call numpy.interp() directly:
import numpy as np
import pandas as pd
import io
data = """x y
22.88 3.047
45.75 3.215
68.63 3.328
91.50 3.423
114.38 3.516
137.25 3.578
163.40 3.676
196.08 3.756
228.76 3.861
261.44 3.942
294.12 4.012
326.80 4.084
359.48 4.147
392.16 4.197"""
s = pd.read_csv(io.BytesIO(data), delim_whitespace=True, index_col=0, squeeze=True)
new_idx = np.arange(23,393)
new_val = np.interp(new_idx, s.index.values.astype(float), s.values)
s2 = pd.Series(new_val, new_idx)

Categories