Python: for loop iterations when adding dataframes

Python: for loop iterations when adding dataframes - python

I have a dataframe with different returns looking something like:
0.2 -0.1 0.03 0.01
0.02 0.1 -0.1 -0.2
0.05 0.06 0.07 -0.07
0.03 -0.04 -0.04 -0.03
And I have a separate dataframe with the index returns in only one column:
0.01
0.015
-0.01
-0.02
What I want to do is to basically add(+) each row value of the index return dataframe with each value for each column in the stock return dataframe.
The desired outcome looks like:
0.21 -0.09
0.035 0.115
0.04 0.05
0.01 -0.06 etc etc
Normally in Matlab for example the for loop would be quite simple, but in python the indexing is what gets me stuck.
I have tried a simple for loop:
for i, j in df_stock_returns.iterrows():
df_new = df_stock_returns[i, j] + df_index_reuturns[j]
But that doesn't really work, any help is appreciated!

Assuming you have
In [27]: df
Out[27]:
0 1 2 3
0 0.20 -0.10 0.03 0.01
1 0.02 0.10 -0.10 -0.20
2 0.05 0.06 0.07 -0.07
3 0.03 -0.04 -0.04 -0.03
and
In [28]: dfi
Out[28]:
0
0 0.010
1 0.015
2 -0.010
3 -0.020
you can just write
In [26]: pd.concat([df[c] + dfi[0] for c in df], axis=1)
Out[26]:
0 0 1 2
0 0.210 -0.090 0.040 0.020
1 0.035 0.115 -0.085 -0.185
2 0.040 0.050 0.060 -0.080
3 0.010 -0.060 -0.060 -0.050
In pandas you almost never need to iterate over individual cells. Here I just iterated over the columns, and df[c] + dfi[0] adds the two columns element-wise. Then concat with axis=1 (0=rows, 1=columns) just concatenates everything into one dataframe.

I suppose the most straightforward way will work
for c in a.columns:
a[c] = a[c] + b
>>> a
0 1 2 3
0 0.210 -0.090 0.040 0.020
1 0.215 -0.085 0.045 0.025
2 0.190 -0.110 0.020 0.000
3 0.180 -0.120 0.010 -0.010

You can simply add two df as below
col1=[0.2,0.02]
col2=[-0.1,0.2]
col3=[0.01,0.015]
df1=pd.DataFrame(data=list(zip(col1, col2)),columns=['list1','list2'])
df2=pd.DataFrame({'list3':col3})
output = df1[:] + df2['list3'].values
The df1[:] extract all columns and it to the reference column df2['list3']

Related

Conditional mean of a dataframe based on datetime column names

I'm new to python . I'm looking for a way to generate mean for row values based on column names(Column names are date series formats from January to December). I want to generate mean for every 10 days for over a period of an year. My dataframe is in the below format(2000 rows)
import pandas as pd
df= pd.DataFrame({'A':[81,80.09,83,85,88],
'B':[21.8,22.04,21.8,21.7,22.06],
'20210113':[0,0.05,0,0,0.433],
'20210122':[0,0.13,0,0,0.128],
'20210125':[0.056,0,0.043,0.062,0.16],
'20210213':[0.9,0.56,0.32,0.8,0],
'20210217':[0.7,0.99,0.008,0.23,0.56],
'20210219':[0.9,0.43,0.76,0.98,0.5]})
Expected Output:
In [2]: df
Out[2]:
A B c(Mean 20210111,..20210119 ) D(Mean of 20210120..20210129)..
0 81 21.8
1 80.09 22.04
2 83 21.8
3 85 21.7
4 88 22.06

One way would be to isolate the date columns from the rest of the DF. Transpose it to be able to use normal grouping operations. Then transpose back and merge to the unaffected portion of the DataFrame.
import pandas as pd
df = pd.DataFrame({'A': [81, 80.09, 83, 85, 88],
'B': [21.8, 22.04, 21.8, 21.7, 22.06],
'20210113A.2': [0, 0.05, 0, 0, 0.433],
'20210122B.1': [0, 0.13, 0, 0, 0.128],
'20210125C.3': [0.056, 0, 0.043, 0.062, 0.16],
'20210213': [0.9, 0.56, 0.32, 0.8, 0],
'20210217': [0.7, 0.99, 0.008, 0.23, 0.56],
'20210219': [0.9, 0.43, 0.76, 0.98, 0.5]})
# Unaffected Columns Go Here
keep_columns = ['A', 'B']
# Get All Affected Columns
new_df = df.loc[:, ~df.columns.isin(keep_columns)]
# Strip Extra Information From Column Names
new_df.columns = new_df.columns.map(lambda c: c[0:8])
# Transpose
new_df = new_df.T
# Convert index to DateTime for easy use
new_df.index = pd.to_datetime(new_df.index, format='%Y%m%d')
# Resample every 10 Days on new DT index (Drop any rows with no values)
new_df = new_df.resample("10D").mean().dropna(how='all')
# Transpose and Merge Back on DF
df = df[keep_columns].merge(new_df.T, left_index=True, right_index=True)
# For Display
print(df.to_string())
Output:
A B 2021-01-13 00:00:00 2021-01-23 00:00:00 2021-02-12 00:00:00
0 81.00 21.80 0.0000 0.056 0.833333
1 80.09 22.04 0.0900 0.000 0.660000
2 83.00 21.80 0.0000 0.043 0.362667
3 85.00 21.70 0.0000 0.062 0.670000
4 88.00 22.06 0.2805 0.160 0.353333
new_df = df.loc[:, ~df.columns.isin(keep_columns)]
new_df
0 1 2 3 4
20210113 0.000 0.05 0.000 0.000 0.433
20210122 0.000 0.13 0.000 0.000 0.128
20210125 0.056 0.00 0.043 0.062 0.160
20210213 0.900 0.56 0.320 0.800 0.000
20210217 0.700 0.99 0.008 0.230 0.560
20210219 0.900 0.43 0.760 0.980 0.500
new_df.index = pd.to_datetime(new_df.index, format='%Y%m%d')
new_df
0 1 2 3 4
2021-01-13 0.000 0.05 0.000 0.000 0.433
2021-01-22 0.000 0.13 0.000 0.000 0.128
2021-01-25 0.056 0.00 0.043 0.062 0.160
2021-02-13 0.900 0.56 0.320 0.800 0.000
2021-02-17 0.700 0.99 0.008 0.230 0.560
2021-02-19 0.900 0.43 0.760 0.980 0.500
new_df = new_df.resample("10D").mean().dropna(how='all')
new_df
0 1 2 3 4
2021-01-13 0.000000 0.09 0.000000 0.000 0.280500
2021-01-23 0.056000 0.00 0.043000 0.062 0.160000
2021-02-12 0.833333 0.66 0.362667 0.670 0.353333
new_df.T
2021-01-13 2021-01-23 2021-02-12
0 0.0000 0.056 0.833333
1 0.0900 0.000 0.660000
2 0.0000 0.043 0.362667
3 0.0000 0.062 0.670000
4 0.2805 0.160 0.353333

Pandas : interpolate a dataframe and replace values

For each column of a dataframe, I did an interpolation using the pandas function "interpolate" and i'm trying to replace values of the dataframe by values of the interpolated curve (trend curve on excel).
I have the following dataframe, named data
0 1
0 0.000 0.002
1 0.001 0.002
2 0.001 0.003
3 0.003 0.004
4 0.003 0.005
5 0.003 0.005
6 0.004 0.006
7 0.005 0.006
8 0.006 0.007
9 0.006 0.007
10 0.007 0.008
11 0.007 0.009
12 0.008 0.010
13 0.008 0.010
14 0.010 0.012
I then did the following code:
for i in range(len(data.columns)):
data[i].interpolate(method="polynomial",order=2,inplace=True)
I thought that inplace would replace values but it don't seems to work. Does someone knowns how to do that?
Thanks and have a good day :)

Try this,
import pandas as pd
import numpy as np
I created a mini text file with some crazy values so you can see how interpolate is working.
File looks like this,
0,1
0.0,.002
0.001,.3
NaN,NaN
4.003,NaN
.004,19
.005,234
NaN,444
1,777
Here is how to import and process your data,
df=pd.read_csv('datafile.txt, header=0)
for column in df:
df[column].interpolate(method="polynomial",order=2,inplace=True)
print(df.head())
the dataframe now looks like this,
0 1
0 0.000000 0.002000
1 0.001000 0.300000
2 2.943616 -30.768123
3 4.003000 -70.313176
4 0.004000 19.000000
5 0.005000 234.000000
6 0.616931 444.000000
7 1.000000 777.000000
Also,
if you mean you want to interpolate between the points in your dataframe, that is something different.
Something like that would be,
df1 = df.reindex(df.index.union(np.linspace(.11,.25,8)))
df1.interpolate('index')
the results of that look like,
0 1
0.00 0.00000 0.00200
0.11 0.00011 0.03478
0.13 0.00013 0.04074
0.15 0.00015 0.04670
0.17 0.00017 0.05266
0.19 0.00019 0.05862
0.21 0.00021 0.06458
0.23 0.00023 0.07054
0.25 0.00025 0.07650
1.00 0.00100 0.30000

It's in fact working with scipy.interpolate.UnivariateSpline

Pandas keep highest value in every n consecutive rows

I have a pandas dataframe called df_initial with two columns 'a' and 'b' and N rows.
I would like to half the rows number, deleting the row where the value of 'b' is lower.
Thus between row 0 and row 1 I will keep row 1, between row 2 and row 3 I will keep row 3 etc..
This is the result that I would like to obtain:
print(df_initial)
a b
0 0.04 0.01
1 0.05 0.22
2 0.06 0.34
3 0.07 0.49
4 0.08 0.71
5 0.09 0.09
6 0.10 0.98
7 0.11 0.42
8 0.12 1.32
9 0.13 0.39
10 0.14 0.97
11 0.15 0.05
12 0.16 0.36
13 0.17 1.72
....
print(df_reduced)
a b
0 0.05 0.22
1 0.07 0.49
2 0.08 0.71
3 0.10 0.98
4 0.12 1.32
5 0.14 0.97
6 0.17 1.72
....
Is there some Pandas function to do this ?
I saw that there is a resample function, DataFrame.resample() , but it is valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, so not in this case.
Thanks who will help me

You can groupby every two rows (a simple way of doing so is taking the floor division of the index) and take the idxmax of column b to index the dataframe:
df.loc[df.groupby(df.index//2).b.idxmax(), :]
a b
0 0.05 0.22
1 0.07 0.49
2 0.09 0.71
3 0.11 0.98
4 0.13 1.32
5 0.15 0.97
6 0.17 1.72
Or using DataFrame.rolling:
df.loc[df.b.rolling(2).max()[1::2].index, :]

This is an application for a simple example, you can apply it on your base.
import numpy as np
import pandas as pd
ar = np.array([[1.1, 1.0], [3.3, 0.2], [2.7, 10],[ 5.4, 7], [5.3, 9],[ 1.5, 15]])
df = pd.DataFrame(ar, columns = ['a', 'b'])
for i in range(len(df)):
if df['b'][i] < df['a'][i]:
df = df.drop(index = i)
print(df)````

Row based chart plot (Seaborn or Matplotlib)

Given that my data is a pandas dataframe and looks like this:
Ref +1 +2 +3 +4 +5 +6 +7
2013-05-28 1 -0.44 0.03 0.06 -0.31 0.13 0.56 0.81
2013-07-05 2 0.84 1.03 0.96 0.90 1.09 0.59 1.15
2013-08-21 3 0.09 0.25 0.06 0.09 -0.09 -0.16 0.56
2014-10-15 4 0.35 1.16 1.91 3.44 2.75 1.97 2.16
2015-02-09 5 0.09 -0.10 -0.38 -0.69 -0.25 -0.85 -0.47
How can I plot a chart of the 5 lines (1 for each ref), where the X axis are the columns (+1, +2...), and starts from 0? If is in seaborn, even better. But matplotlib solutions are also welcome.

Plotting a dataframe in pandas is generally all about reshaping the table so that the individual lines you want are in separate columns, and the x-values are in the index. Some of these reshape operations are a bit ugly, but you can do:
df = pd.read_clipboard()
plot_table = pd.melt(df.reset_index(), id_vars=['index', 'Ref'])
plot_table = plot_table.pivot(index='variable', columns='Ref', values='value')
# Add extra row to have all lines start from 0:
plot_table.loc['+0', :] = 0
plot_table = plot_table.sort_index()
plot_table
Ref 1 2 3 4 5
variable
+0 0.00 0.00 0.00 0.00 0.00
+1 -0.44 0.84 0.09 0.35 0.09
+2 0.03 1.03 0.25 1.16 -0.10
+3 0.06 0.96 0.06 1.91 -0.38
+4 -0.31 0.90 0.09 3.44 -0.69
+5 0.13 1.09 -0.09 2.75 -0.25
+6 0.56 0.59 -0.16 1.97 -0.85
+7 0.81 1.15 0.56 2.16 -0.47
Now that you have a table with the right shape, plotting is pretty automatic:
plot_table.plot()

feed empty pandas.dataframe with several files

I would like to feed a empty dataframe appending several files of the same type and structure. However, I can't see what's wrong here:
def files2df(colnames, ext):
df = DataFrame(columns = colnames)
for inf in sorted(glob.glob(ext)):
dfin = read_csv(inf, sep='\t', skiprows=1)
print(dfin.head(), '\n')
df.append(dfin, ignore_index=True)
return df
The resulting dataframe is empty. Could someone give me a hand?
1.0 16.59 0.597 0.87 1.0.1 3282 100.08
0 0.953 14.52 0.561 0.80 0.99 4355 -
1 1.000 31.59 1.000 0.94 1.00 6322 -
2 1.000 6.09 0.237 0.71 1.00 10568 -
3 1.000 31.29 1.000 0.94 1.00 14363 -
4 1.000 31.59 1.000 0.94 1.00 19797 -
1.0 6.69 0.199 0.74 1.0.1 186 13.16
0 1 0.88 0.020 0.13 0.99 394 -
1 1 0.75 0.017 0.11 0.99 1052 -
2 1 3.34 0.097 0.57 1.00 1178 -
3 1 1.50 0.035 0.26 1.00 1211 -
4 1 20.59 0.940 0.88 1.00 1583 -
1.0 0.12 0.0030 0.04 0.97 2285 2.62
0 1 1.25 0.135 0.18 0.99 2480 -
1 1 0.03 0.001 0.04 0.97 7440 -
2 1 0.12 0.003 0.04 0.97 8199 -
3 1 1.10 0.092 0.16 0.99 11174 -
4 1 0.27 0.007 0.06 0.98 11310 -
0.244 0.07 0.0030 0.02 0.76 41314 1.32
0 0.181 0.64 0.028 0.03 0.36 41755 -
1 0.161 0.18 0.008 0.01 0.45 42420 -
2 0.161 0.18 0.008 0.01 0.45 42461 -
3 0.237 0.25 0.011 0.02 0.56 43060 -
4 0.267 1.03 0.047 0.07 0.46 43321 -
0.163 0.12 0.0060 0.01 0.5 103384 1.27
0 0.243 0.27 0.014 0.02 0.56 104693 -
1 0.215 0.66 0.029 0.04 0.41 105192 -
2 0.190 0.10 0.005 0.01 0.59 105758 -
3 0.161 0.12 0.006 0.01 0.50 109783 -
4 0.144 0.16 0.007 0.01 0.42 110067 -
Empty DataFrame
Columns: array([D, LOD, r2, CIlow, CIhi, Dist, T-int], dtype=object)
Index: array([], dtype=object)

df.append(dfin, ignore_index=True) returns a new DataFrame, it does not change df in place.
Use df = df.append(dfin, ignore_index=True). But even with this change i think this will not give what you need. Append extends a frame on axis=1 (columns), but i believe you want to combine the data on axis=0 (rows)
In this scenario (reading multiple files and use all data to create a single DataFrame), i would use pandas.concat(). The code below will give you a frame with columns named by colnames, and the rows are formed by the data in the csv files.
def files2df(colnames, ext):
files = sorted(glob.glob(ext))
frames = [read_csv(inf, sep='\t', skiprows=1, names=colnames) for inf in files]
return concat(frames, ignore_index=True)
I did not try this code, just wrote it here, maybe you need tweak it to get it running, but the idea is clear (i hope).

Also, I found another solution, but don't know which one is faster.
def files2df(colnames, ext):
dflist = [ ]
for inf in sorted(glob.glob(ext)):
dflist.append(read_csv(inf, names = colnames, sep='\t', skiprows=1))
#print(dflist)
df = concat(dflist, axis = 0, ignore_index=True)
#print(df.to_string())
return df

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: for loop iterations when adding dataframes - python

I suppose the most straightforward way will work for c in a.columns: a[c] = a[c] + b >>> a 0 1 2 3 0 0.210 -0.090 0.040 0.020 1 0.215 -0.085 0.045 0.025 2 0.190 -0.110 0.020 0.000 3 0.180 -0.120 0.010 -0.010

Related

Conditional mean of a dataframe based on datetime column names

Pandas : interpolate a dataframe and replace values

Pandas keep highest value in every n consecutive rows

Row based chart plot (Seaborn or Matplotlib)

feed empty pandas.dataframe with several files

Categories

Resources