Pandas : interpolate a dataframe and replace values - python

For each column of a dataframe, I did an interpolation using the pandas function "interpolate" and i'm trying to replace values of the dataframe by values of the interpolated curve (trend curve on excel).
I have the following dataframe, named data
0 1
0 0.000 0.002
1 0.001 0.002
2 0.001 0.003
3 0.003 0.004
4 0.003 0.005
5 0.003 0.005
6 0.004 0.006
7 0.005 0.006
8 0.006 0.007
9 0.006 0.007
10 0.007 0.008
11 0.007 0.009
12 0.008 0.010
13 0.008 0.010
14 0.010 0.012
I then did the following code:
for i in range(len(data.columns)):
data[i].interpolate(method="polynomial",order=2,inplace=True)
I thought that inplace would replace values but it don't seems to work. Does someone knowns how to do that?
Thanks and have a good day :)

Try this,
import pandas as pd
import numpy as np
I created a mini text file with some crazy values so you can see how interpolate is working.
File looks like this,
0,1
0.0,.002
0.001,.3
NaN,NaN
4.003,NaN
.004,19
.005,234
NaN,444
1,777
Here is how to import and process your data,
df=pd.read_csv('datafile.txt, header=0)
for column in df:
df[column].interpolate(method="polynomial",order=2,inplace=True)
print(df.head())
the dataframe now looks like this,
0 1
0 0.000000 0.002000
1 0.001000 0.300000
2 2.943616 -30.768123
3 4.003000 -70.313176
4 0.004000 19.000000
5 0.005000 234.000000
6 0.616931 444.000000
7 1.000000 777.000000
Also,
if you mean you want to interpolate between the points in your dataframe, that is something different.
Something like that would be,
df1 = df.reindex(df.index.union(np.linspace(.11,.25,8)))
df1.interpolate('index')
the results of that look like,
0 1
0.00 0.00000 0.00200
0.11 0.00011 0.03478
0.13 0.00013 0.04074
0.15 0.00015 0.04670
0.17 0.00017 0.05266
0.19 0.00019 0.05862
0.21 0.00021 0.06458
0.23 0.00023 0.07054
0.25 0.00025 0.07650
1.00 0.00100 0.30000

It's in fact working with scipy.interpolate.UnivariateSpline

Related

Python: for loop iterations when adding dataframes

I have a dataframe with different returns looking something like:
0.2 -0.1 0.03 0.01
0.02 0.1 -0.1 -0.2
0.05 0.06 0.07 -0.07
0.03 -0.04 -0.04 -0.03
And I have a separate dataframe with the index returns in only one column:
0.01
0.015
-0.01
-0.02
What I want to do is to basically add(+) each row value of the index return dataframe with each value for each column in the stock return dataframe.
The desired outcome looks like:
0.21 -0.09
0.035 0.115
0.04 0.05
0.01 -0.06 etc etc
Normally in Matlab for example the for loop would be quite simple, but in python the indexing is what gets me stuck.
I have tried a simple for loop:
for i, j in df_stock_returns.iterrows():
df_new = df_stock_returns[i, j] + df_index_reuturns[j]
But that doesn't really work, any help is appreciated!
Assuming you have
In [27]: df
Out[27]:
0 1 2 3
0 0.20 -0.10 0.03 0.01
1 0.02 0.10 -0.10 -0.20
2 0.05 0.06 0.07 -0.07
3 0.03 -0.04 -0.04 -0.03
and
In [28]: dfi
Out[28]:
0
0 0.010
1 0.015
2 -0.010
3 -0.020
you can just write
In [26]: pd.concat([df[c] + dfi[0] for c in df], axis=1)
Out[26]:
0 0 1 2
0 0.210 -0.090 0.040 0.020
1 0.035 0.115 -0.085 -0.185
2 0.040 0.050 0.060 -0.080
3 0.010 -0.060 -0.060 -0.050
In pandas you almost never need to iterate over individual cells. Here I just iterated over the columns, and df[c] + dfi[0] adds the two columns element-wise. Then concat with axis=1 (0=rows, 1=columns) just concatenates everything into one dataframe.
I suppose the most straightforward way will work
for c in a.columns:
a[c] = a[c] + b
>>> a
0 1 2 3
0 0.210 -0.090 0.040 0.020
1 0.215 -0.085 0.045 0.025
2 0.190 -0.110 0.020 0.000
3 0.180 -0.120 0.010 -0.010
You can simply add two df as below
col1=[0.2,0.02]
col2=[-0.1,0.2]
col3=[0.01,0.015]
df1=pd.DataFrame(data=list(zip(col1, col2)),columns=['list1','list2'])
df2=pd.DataFrame({'list3':col3})
output = df1[:] + df2['list3'].values
The df1[:] extract all columns and it to the reference column df2['list3']

Conditional mean of a dataframe based on datetime column names

I'm new to python . I'm looking for a way to generate mean for row values based on column names(Column names are date series formats from January to December). I want to generate mean for every 10 days for over a period of an year. My dataframe is in the below format(2000 rows)
import pandas as pd
df= pd.DataFrame({'A':[81,80.09,83,85,88],
'B':[21.8,22.04,21.8,21.7,22.06],
'20210113':[0,0.05,0,0,0.433],
'20210122':[0,0.13,0,0,0.128],
'20210125':[0.056,0,0.043,0.062,0.16],
'20210213':[0.9,0.56,0.32,0.8,0],
'20210217':[0.7,0.99,0.008,0.23,0.56],
'20210219':[0.9,0.43,0.76,0.98,0.5]})
Expected Output:
In [2]: df
Out[2]:
A B c(Mean 20210111,..20210119 ) D(Mean of 20210120..20210129)..
0 81 21.8
1 80.09 22.04
2 83 21.8
3 85 21.7
4 88 22.06
One way would be to isolate the date columns from the rest of the DF. Transpose it to be able to use normal grouping operations. Then transpose back and merge to the unaffected portion of the DataFrame.
import pandas as pd
df = pd.DataFrame({'A': [81, 80.09, 83, 85, 88],
'B': [21.8, 22.04, 21.8, 21.7, 22.06],
'20210113A.2': [0, 0.05, 0, 0, 0.433],
'20210122B.1': [0, 0.13, 0, 0, 0.128],
'20210125C.3': [0.056, 0, 0.043, 0.062, 0.16],
'20210213': [0.9, 0.56, 0.32, 0.8, 0],
'20210217': [0.7, 0.99, 0.008, 0.23, 0.56],
'20210219': [0.9, 0.43, 0.76, 0.98, 0.5]})
# Unaffected Columns Go Here
keep_columns = ['A', 'B']
# Get All Affected Columns
new_df = df.loc[:, ~df.columns.isin(keep_columns)]
# Strip Extra Information From Column Names
new_df.columns = new_df.columns.map(lambda c: c[0:8])
# Transpose
new_df = new_df.T
# Convert index to DateTime for easy use
new_df.index = pd.to_datetime(new_df.index, format='%Y%m%d')
# Resample every 10 Days on new DT index (Drop any rows with no values)
new_df = new_df.resample("10D").mean().dropna(how='all')
# Transpose and Merge Back on DF
df = df[keep_columns].merge(new_df.T, left_index=True, right_index=True)
# For Display
print(df.to_string())
Output:
A B 2021-01-13 00:00:00 2021-01-23 00:00:00 2021-02-12 00:00:00
0 81.00 21.80 0.0000 0.056 0.833333
1 80.09 22.04 0.0900 0.000 0.660000
2 83.00 21.80 0.0000 0.043 0.362667
3 85.00 21.70 0.0000 0.062 0.670000
4 88.00 22.06 0.2805 0.160 0.353333
new_df = df.loc[:, ~df.columns.isin(keep_columns)]
new_df
0 1 2 3 4
20210113 0.000 0.05 0.000 0.000 0.433
20210122 0.000 0.13 0.000 0.000 0.128
20210125 0.056 0.00 0.043 0.062 0.160
20210213 0.900 0.56 0.320 0.800 0.000
20210217 0.700 0.99 0.008 0.230 0.560
20210219 0.900 0.43 0.760 0.980 0.500
new_df.index = pd.to_datetime(new_df.index, format='%Y%m%d')
new_df
0 1 2 3 4
2021-01-13 0.000 0.05 0.000 0.000 0.433
2021-01-22 0.000 0.13 0.000 0.000 0.128
2021-01-25 0.056 0.00 0.043 0.062 0.160
2021-02-13 0.900 0.56 0.320 0.800 0.000
2021-02-17 0.700 0.99 0.008 0.230 0.560
2021-02-19 0.900 0.43 0.760 0.980 0.500
new_df = new_df.resample("10D").mean().dropna(how='all')
new_df
0 1 2 3 4
2021-01-13 0.000000 0.09 0.000000 0.000 0.280500
2021-01-23 0.056000 0.00 0.043000 0.062 0.160000
2021-02-12 0.833333 0.66 0.362667 0.670 0.353333
new_df.T
2021-01-13 2021-01-23 2021-02-12
0 0.0000 0.056 0.833333
1 0.0900 0.000 0.660000
2 0.0000 0.043 0.362667
3 0.0000 0.062 0.670000
4 0.2805 0.160 0.353333

Count values in column and assign to row

I have a dataframe like this:
dT_sampleTime steps
0 0.002 0.001
1 0.004 0.002
2 0.004 0.003
3 0.004 0.004
4 0.003 0.005
5 0.007 0.006
6 0.001 0.007
and I want to count how often the value of steps occurs in the column dT_sampleTime and create a new column absolute frequency.
dT_sampleTime steps absolute frequency
0 0.002 0.001 1
1 0.004 0.002 1
2 0.004 0.003 1
3 0.004 0.004 3
4 0.003 0.005 0
5 0.007 0.006 0
6 0.001 0.007 1
My idea was something like this:
df['absolute frequency'] = df.groupby(df['steps'],df['dT_sampleTime']).count
map the 'steps' column with the value_counts of the 'dt_sampleTime' column. Then fill the missing values with 0.
df['absolute frequency'] = (df['steps'].map(df['dT_sampleTime'].value_counts())
.fillna(0, downcast='infer'))
# dT_sampleTime steps absolute frequency
#0 0.002 0.001 1
#1 0.004 0.002 1
#2 0.004 0.003 1
#3 0.004 0.004 3
#4 0.003 0.005 0
#5 0.007 0.006 0
#6 0.001 0.007 1
When mapping with a Series it uses the index to look up the appropriate value. The value_counts Series is
df['dT_sampleTime'].value_counts()
#0.004 3
#0.007 1
#0.001 1
#0.002 1
#0.003 1
#Name: dT_sampleTime, dtype: int64
so 0.004 in the steps columns goes to 3, for instance.
Loop across df
Use the steps value of each row as a filter applied to the dT_sampleTime column
The number of rows in the resultant DataFrame is the absolute frequency of the current steps value within the dt_sampleTime column
Append this value to the current row under the absolute frequency field
for i, row in df.iterrows():
df.loc[i, 'absolute frequency'] = len(df[df['dT_sampleTime'] == row['steps']])
Resulting df based on the example given in your original question:
dT_sampleTime steps absolute frequency
0 0.002 0.001 1.0
1 0.004 0.002 1.0
2 0.004 0.003 1.0
3 0.004 0.004 3.0
4 0.003 0.005 0.0
5 0.007 0.006 0.0
6 0.001 0.007 1.0
I'm not sure that this is the most efficient way to achieve your ends, however it does work quite well and should be suitable for your purpose. Happy to take feedback on this from anybody if they know better and would be so kind.

Split time series to find annual maximum from a data frame in Pandas

I have a time series dataset as follows where first column is year, second month, third date, and subsequent columns are time series data sets:
1995 1 1 0.0 1.929 23.015 1.429 0.806 0.177 0.027
1995 1 2 0.000 1.097 12.954 0.000 0.196 0.361 0.233
1995 1 3 0.000 11.391 0.228 0.004 2.134 11.190 0.028
1995 1 4 0.504 0.373 0.197 0.333 5.894 0.003 0.098
1995 1 5 0.027 20.957 0.115 0.208 0.000 0.000 0.104
1995 1 6 0.043 9.952 0.042 2.499 1.406 0.000 0.748
.... . . ..... ..... ..... ..... ..... ..... .....
2000 12 31 50.98 23.23 98.78 34.23 34.54 45.54 34.21
I have read it using:
pd.read_csv('D:/test.csv')
I want to read data columns and find annual maximum value from them. Any suggestion on how to group or split data to find annual maximum values for each variable would be helpful.
IIUC, given your sample dataframe df, you can read it with:
df = pd.read_csv('D:/test.csv', header=None)
and then groupby by column 0 (year) and get the maximum for each value:
g = df.groupby(0).max()
This returns:
1 2 3 4 5 6 7 8 9
0
1995 1 6 0.504 20.957 23.015 2.499 5.894 11.19 0.748

feed empty pandas.dataframe with several files

I would like to feed a empty dataframe appending several files of the same type and structure. However, I can't see what's wrong here:
def files2df(colnames, ext):
df = DataFrame(columns = colnames)
for inf in sorted(glob.glob(ext)):
dfin = read_csv(inf, sep='\t', skiprows=1)
print(dfin.head(), '\n')
df.append(dfin, ignore_index=True)
return df
The resulting dataframe is empty. Could someone give me a hand?
1.0 16.59 0.597 0.87 1.0.1 3282 100.08
0 0.953 14.52 0.561 0.80 0.99 4355 -
1 1.000 31.59 1.000 0.94 1.00 6322 -
2 1.000 6.09 0.237 0.71 1.00 10568 -
3 1.000 31.29 1.000 0.94 1.00 14363 -
4 1.000 31.59 1.000 0.94 1.00 19797 -
1.0 6.69 0.199 0.74 1.0.1 186 13.16
0 1 0.88 0.020 0.13 0.99 394 -
1 1 0.75 0.017 0.11 0.99 1052 -
2 1 3.34 0.097 0.57 1.00 1178 -
3 1 1.50 0.035 0.26 1.00 1211 -
4 1 20.59 0.940 0.88 1.00 1583 -
1.0 0.12 0.0030 0.04 0.97 2285 2.62
0 1 1.25 0.135 0.18 0.99 2480 -
1 1 0.03 0.001 0.04 0.97 7440 -
2 1 0.12 0.003 0.04 0.97 8199 -
3 1 1.10 0.092 0.16 0.99 11174 -
4 1 0.27 0.007 0.06 0.98 11310 -
0.244 0.07 0.0030 0.02 0.76 41314 1.32
0 0.181 0.64 0.028 0.03 0.36 41755 -
1 0.161 0.18 0.008 0.01 0.45 42420 -
2 0.161 0.18 0.008 0.01 0.45 42461 -
3 0.237 0.25 0.011 0.02 0.56 43060 -
4 0.267 1.03 0.047 0.07 0.46 43321 -
0.163 0.12 0.0060 0.01 0.5 103384 1.27
0 0.243 0.27 0.014 0.02 0.56 104693 -
1 0.215 0.66 0.029 0.04 0.41 105192 -
2 0.190 0.10 0.005 0.01 0.59 105758 -
3 0.161 0.12 0.006 0.01 0.50 109783 -
4 0.144 0.16 0.007 0.01 0.42 110067 -
Empty DataFrame
Columns: array([D, LOD, r2, CIlow, CIhi, Dist, T-int], dtype=object)
Index: array([], dtype=object)
df.append(dfin, ignore_index=True) returns a new DataFrame, it does not change df in place.
Use df = df.append(dfin, ignore_index=True). But even with this change i think this will not give what you need. Append extends a frame on axis=1 (columns), but i believe you want to combine the data on axis=0 (rows)
In this scenario (reading multiple files and use all data to create a single DataFrame), i would use pandas.concat(). The code below will give you a frame with columns named by colnames, and the rows are formed by the data in the csv files.
def files2df(colnames, ext):
files = sorted(glob.glob(ext))
frames = [read_csv(inf, sep='\t', skiprows=1, names=colnames) for inf in files]
return concat(frames, ignore_index=True)
I did not try this code, just wrote it here, maybe you need tweak it to get it running, but the idea is clear (i hope).
Also, I found another solution, but don't know which one is faster.
def files2df(colnames, ext):
dflist = [ ]
for inf in sorted(glob.glob(ext)):
dflist.append(read_csv(inf, names = colnames, sep='\t', skiprows=1))
#print(dflist)
df = concat(dflist, axis = 0, ignore_index=True)
#print(df.to_string())
return df

Categories