I'm new to python . I'm looking for a way to generate mean for row values based on column names(Column names are date series formats from January to December). I want to generate mean for every 10 days for over a period of an year. My dataframe is in the below format(2000 rows)
import pandas as pd
df= pd.DataFrame({'A':[81,80.09,83,85,88],
'B':[21.8,22.04,21.8,21.7,22.06],
'20210113':[0,0.05,0,0,0.433],
'20210122':[0,0.13,0,0,0.128],
'20210125':[0.056,0,0.043,0.062,0.16],
'20210213':[0.9,0.56,0.32,0.8,0],
'20210217':[0.7,0.99,0.008,0.23,0.56],
'20210219':[0.9,0.43,0.76,0.98,0.5]})
Expected Output:
In [2]: df
Out[2]:
A B c(Mean 20210111,..20210119 ) D(Mean of 20210120..20210129)..
0 81 21.8
1 80.09 22.04
2 83 21.8
3 85 21.7
4 88 22.06
One way would be to isolate the date columns from the rest of the DF. Transpose it to be able to use normal grouping operations. Then transpose back and merge to the unaffected portion of the DataFrame.
import pandas as pd
df = pd.DataFrame({'A': [81, 80.09, 83, 85, 88],
'B': [21.8, 22.04, 21.8, 21.7, 22.06],
'20210113A.2': [0, 0.05, 0, 0, 0.433],
'20210122B.1': [0, 0.13, 0, 0, 0.128],
'20210125C.3': [0.056, 0, 0.043, 0.062, 0.16],
'20210213': [0.9, 0.56, 0.32, 0.8, 0],
'20210217': [0.7, 0.99, 0.008, 0.23, 0.56],
'20210219': [0.9, 0.43, 0.76, 0.98, 0.5]})
# Unaffected Columns Go Here
keep_columns = ['A', 'B']
# Get All Affected Columns
new_df = df.loc[:, ~df.columns.isin(keep_columns)]
# Strip Extra Information From Column Names
new_df.columns = new_df.columns.map(lambda c: c[0:8])
# Transpose
new_df = new_df.T
# Convert index to DateTime for easy use
new_df.index = pd.to_datetime(new_df.index, format='%Y%m%d')
# Resample every 10 Days on new DT index (Drop any rows with no values)
new_df = new_df.resample("10D").mean().dropna(how='all')
# Transpose and Merge Back on DF
df = df[keep_columns].merge(new_df.T, left_index=True, right_index=True)
# For Display
print(df.to_string())
Output:
A B 2021-01-13 00:00:00 2021-01-23 00:00:00 2021-02-12 00:00:00
0 81.00 21.80 0.0000 0.056 0.833333
1 80.09 22.04 0.0900 0.000 0.660000
2 83.00 21.80 0.0000 0.043 0.362667
3 85.00 21.70 0.0000 0.062 0.670000
4 88.00 22.06 0.2805 0.160 0.353333
new_df = df.loc[:, ~df.columns.isin(keep_columns)]
new_df
0 1 2 3 4
20210113 0.000 0.05 0.000 0.000 0.433
20210122 0.000 0.13 0.000 0.000 0.128
20210125 0.056 0.00 0.043 0.062 0.160
20210213 0.900 0.56 0.320 0.800 0.000
20210217 0.700 0.99 0.008 0.230 0.560
20210219 0.900 0.43 0.760 0.980 0.500
new_df.index = pd.to_datetime(new_df.index, format='%Y%m%d')
new_df
0 1 2 3 4
2021-01-13 0.000 0.05 0.000 0.000 0.433
2021-01-22 0.000 0.13 0.000 0.000 0.128
2021-01-25 0.056 0.00 0.043 0.062 0.160
2021-02-13 0.900 0.56 0.320 0.800 0.000
2021-02-17 0.700 0.99 0.008 0.230 0.560
2021-02-19 0.900 0.43 0.760 0.980 0.500
new_df = new_df.resample("10D").mean().dropna(how='all')
new_df
0 1 2 3 4
2021-01-13 0.000000 0.09 0.000000 0.000 0.280500
2021-01-23 0.056000 0.00 0.043000 0.062 0.160000
2021-02-12 0.833333 0.66 0.362667 0.670 0.353333
new_df.T
2021-01-13 2021-01-23 2021-02-12
0 0.0000 0.056 0.833333
1 0.0900 0.000 0.660000
2 0.0000 0.043 0.362667
3 0.0000 0.062 0.670000
4 0.2805 0.160 0.353333
Related
I have this dataframe, data.
data = pd.DataFrame({'group':['A', 'A', 'B', 'C', 'C', 'B'],
'value':[0.2, 0.21, 0.54, 0.02, 0.001, 0.19]})
I want to build three new features. Below is my target output.
pd.DataFrame({'group':['A', 'A', 'B', 'C', 'C', 'B'],
'value':[0.2, 0.21, 0.54, 0.02, 0.001, 0.19],
'group_A':[0.2, 0.21, 0,0,0,0],
'group_B':[0,0,0.54, 0, 0, 0.19],
'group_C':[0,0,0,0.02, 0.001,0]})
What is the most efficient way to perform such a task? The code below solves the problem. But perhaps there is a vectorized way to do it on my very large real world data set?
for g in data.group.unique():
tmp= [0 if j==g else i for i, j in zip(data.value, data.group)]
data['group_{}'.format(g)]=tmp
Use DataFrame.join with DataFrame.pivot, DataFrame.add_prefix and DataFrame.fillna:
df = (data.join(data.reset_index()
.pivot('index','group','value')
.add_prefix('group_')
.fillna(0)))
print (df)
group value group_A group_B group_C
0 A 0.200 0.20 0.00 0.000
1 A 0.210 0.21 0.00 0.000
2 B 0.540 0.00 0.54 0.000
3 C 0.020 0.00 0.00 0.020
4 C 0.001 0.00 0.00 0.001
5 B 0.190 0.00 0.19 0.000
Alternative solution:
df = (data.join(data.set_index('group', append=True)['value']
.unstack(fill_value=0)
.add_prefix('group_')))
print (df)
group value group_A group_B group_C
0 A 0.200 0.20 0.00 0.000
1 A 0.210 0.21 0.00 0.000
2 B 0.540 0.00 0.54 0.000
3 C 0.020 0.00 0.00 0.020
4 C 0.001 0.00 0.00 0.001
5 B 0.190 0.00 0.19 0.000
I'd like to replicate this method of winsorizing a dataframe with specified percentile regions in python. I tried using the scipy winsorize function but that didn't get the results I was looking for.
Example expected output for a dataframe winsorized by 0.01% low percentage value and 0.99% high percentage value across each date:
Original df:
A B C D E
2020-06-30 0.033 -0.182 -0.016 0.665 0.025
2020-07-31 0.142 -0.175 -0.016 0.556 0.024
2020-08-31 0.115 -0.187 -0.017 0.627 0.027
2020-09-30 0.032 -0.096 -0.022 0.572 0.024
Winsorized data:
A B C D E
2020-06-30 0.033 -0.175 -0.016 0.64 0.025
2020-07-31 0.142 -0.169 -0.016 0.54 0.024
2020-08-31 0.115 -0.18 -0.017 0.606 0.027
2020-09-30 0.032 -0.093 -0.022 0.55 0.024
I have a dataframe with different returns looking something like:
0.2 -0.1 0.03 0.01
0.02 0.1 -0.1 -0.2
0.05 0.06 0.07 -0.07
0.03 -0.04 -0.04 -0.03
And I have a separate dataframe with the index returns in only one column:
0.01
0.015
-0.01
-0.02
What I want to do is to basically add(+) each row value of the index return dataframe with each value for each column in the stock return dataframe.
The desired outcome looks like:
0.21 -0.09
0.035 0.115
0.04 0.05
0.01 -0.06 etc etc
Normally in Matlab for example the for loop would be quite simple, but in python the indexing is what gets me stuck.
I have tried a simple for loop:
for i, j in df_stock_returns.iterrows():
df_new = df_stock_returns[i, j] + df_index_reuturns[j]
But that doesn't really work, any help is appreciated!
Assuming you have
In [27]: df
Out[27]:
0 1 2 3
0 0.20 -0.10 0.03 0.01
1 0.02 0.10 -0.10 -0.20
2 0.05 0.06 0.07 -0.07
3 0.03 -0.04 -0.04 -0.03
and
In [28]: dfi
Out[28]:
0
0 0.010
1 0.015
2 -0.010
3 -0.020
you can just write
In [26]: pd.concat([df[c] + dfi[0] for c in df], axis=1)
Out[26]:
0 0 1 2
0 0.210 -0.090 0.040 0.020
1 0.035 0.115 -0.085 -0.185
2 0.040 0.050 0.060 -0.080
3 0.010 -0.060 -0.060 -0.050
In pandas you almost never need to iterate over individual cells. Here I just iterated over the columns, and df[c] + dfi[0] adds the two columns element-wise. Then concat with axis=1 (0=rows, 1=columns) just concatenates everything into one dataframe.
I suppose the most straightforward way will work
for c in a.columns:
a[c] = a[c] + b
>>> a
0 1 2 3
0 0.210 -0.090 0.040 0.020
1 0.215 -0.085 0.045 0.025
2 0.190 -0.110 0.020 0.000
3 0.180 -0.120 0.010 -0.010
You can simply add two df as below
col1=[0.2,0.02]
col2=[-0.1,0.2]
col3=[0.01,0.015]
df1=pd.DataFrame(data=list(zip(col1, col2)),columns=['list1','list2'])
df2=pd.DataFrame({'list3':col3})
output = df1[:] + df2['list3'].values
The df1[:] extract all columns and it to the reference column df2['list3']
For each column of a dataframe, I did an interpolation using the pandas function "interpolate" and i'm trying to replace values of the dataframe by values of the interpolated curve (trend curve on excel).
I have the following dataframe, named data
0 1
0 0.000 0.002
1 0.001 0.002
2 0.001 0.003
3 0.003 0.004
4 0.003 0.005
5 0.003 0.005
6 0.004 0.006
7 0.005 0.006
8 0.006 0.007
9 0.006 0.007
10 0.007 0.008
11 0.007 0.009
12 0.008 0.010
13 0.008 0.010
14 0.010 0.012
I then did the following code:
for i in range(len(data.columns)):
data[i].interpolate(method="polynomial",order=2,inplace=True)
I thought that inplace would replace values but it don't seems to work. Does someone knowns how to do that?
Thanks and have a good day :)
Try this,
import pandas as pd
import numpy as np
I created a mini text file with some crazy values so you can see how interpolate is working.
File looks like this,
0,1
0.0,.002
0.001,.3
NaN,NaN
4.003,NaN
.004,19
.005,234
NaN,444
1,777
Here is how to import and process your data,
df=pd.read_csv('datafile.txt, header=0)
for column in df:
df[column].interpolate(method="polynomial",order=2,inplace=True)
print(df.head())
the dataframe now looks like this,
0 1
0 0.000000 0.002000
1 0.001000 0.300000
2 2.943616 -30.768123
3 4.003000 -70.313176
4 0.004000 19.000000
5 0.005000 234.000000
6 0.616931 444.000000
7 1.000000 777.000000
Also,
if you mean you want to interpolate between the points in your dataframe, that is something different.
Something like that would be,
df1 = df.reindex(df.index.union(np.linspace(.11,.25,8)))
df1.interpolate('index')
the results of that look like,
0 1
0.00 0.00000 0.00200
0.11 0.00011 0.03478
0.13 0.00013 0.04074
0.15 0.00015 0.04670
0.17 0.00017 0.05266
0.19 0.00019 0.05862
0.21 0.00021 0.06458
0.23 0.00023 0.07054
0.25 0.00025 0.07650
1.00 0.00100 0.30000
It's in fact working with scipy.interpolate.UnivariateSpline
I would like to feed a empty dataframe appending several files of the same type and structure. However, I can't see what's wrong here:
def files2df(colnames, ext):
df = DataFrame(columns = colnames)
for inf in sorted(glob.glob(ext)):
dfin = read_csv(inf, sep='\t', skiprows=1)
print(dfin.head(), '\n')
df.append(dfin, ignore_index=True)
return df
The resulting dataframe is empty. Could someone give me a hand?
1.0 16.59 0.597 0.87 1.0.1 3282 100.08
0 0.953 14.52 0.561 0.80 0.99 4355 -
1 1.000 31.59 1.000 0.94 1.00 6322 -
2 1.000 6.09 0.237 0.71 1.00 10568 -
3 1.000 31.29 1.000 0.94 1.00 14363 -
4 1.000 31.59 1.000 0.94 1.00 19797 -
1.0 6.69 0.199 0.74 1.0.1 186 13.16
0 1 0.88 0.020 0.13 0.99 394 -
1 1 0.75 0.017 0.11 0.99 1052 -
2 1 3.34 0.097 0.57 1.00 1178 -
3 1 1.50 0.035 0.26 1.00 1211 -
4 1 20.59 0.940 0.88 1.00 1583 -
1.0 0.12 0.0030 0.04 0.97 2285 2.62
0 1 1.25 0.135 0.18 0.99 2480 -
1 1 0.03 0.001 0.04 0.97 7440 -
2 1 0.12 0.003 0.04 0.97 8199 -
3 1 1.10 0.092 0.16 0.99 11174 -
4 1 0.27 0.007 0.06 0.98 11310 -
0.244 0.07 0.0030 0.02 0.76 41314 1.32
0 0.181 0.64 0.028 0.03 0.36 41755 -
1 0.161 0.18 0.008 0.01 0.45 42420 -
2 0.161 0.18 0.008 0.01 0.45 42461 -
3 0.237 0.25 0.011 0.02 0.56 43060 -
4 0.267 1.03 0.047 0.07 0.46 43321 -
0.163 0.12 0.0060 0.01 0.5 103384 1.27
0 0.243 0.27 0.014 0.02 0.56 104693 -
1 0.215 0.66 0.029 0.04 0.41 105192 -
2 0.190 0.10 0.005 0.01 0.59 105758 -
3 0.161 0.12 0.006 0.01 0.50 109783 -
4 0.144 0.16 0.007 0.01 0.42 110067 -
Empty DataFrame
Columns: array([D, LOD, r2, CIlow, CIhi, Dist, T-int], dtype=object)
Index: array([], dtype=object)
df.append(dfin, ignore_index=True) returns a new DataFrame, it does not change df in place.
Use df = df.append(dfin, ignore_index=True). But even with this change i think this will not give what you need. Append extends a frame on axis=1 (columns), but i believe you want to combine the data on axis=0 (rows)
In this scenario (reading multiple files and use all data to create a single DataFrame), i would use pandas.concat(). The code below will give you a frame with columns named by colnames, and the rows are formed by the data in the csv files.
def files2df(colnames, ext):
files = sorted(glob.glob(ext))
frames = [read_csv(inf, sep='\t', skiprows=1, names=colnames) for inf in files]
return concat(frames, ignore_index=True)
I did not try this code, just wrote it here, maybe you need tweak it to get it running, but the idea is clear (i hope).
Also, I found another solution, but don't know which one is faster.
def files2df(colnames, ext):
dflist = [ ]
for inf in sorted(glob.glob(ext)):
dflist.append(read_csv(inf, names = colnames, sep='\t', skiprows=1))
#print(dflist)
df = concat(dflist, axis = 0, ignore_index=True)
#print(df.to_string())
return df