feed empty pandas.dataframe with several files - python

I would like to feed a empty dataframe appending several files of the same type and structure. However, I can't see what's wrong here:
def files2df(colnames, ext):
df = DataFrame(columns = colnames)
for inf in sorted(glob.glob(ext)):
dfin = read_csv(inf, sep='\t', skiprows=1)
print(dfin.head(), '\n')
df.append(dfin, ignore_index=True)
return df
The resulting dataframe is empty. Could someone give me a hand?
1.0 16.59 0.597 0.87 1.0.1 3282 100.08
0 0.953 14.52 0.561 0.80 0.99 4355 -
1 1.000 31.59 1.000 0.94 1.00 6322 -
2 1.000 6.09 0.237 0.71 1.00 10568 -
3 1.000 31.29 1.000 0.94 1.00 14363 -
4 1.000 31.59 1.000 0.94 1.00 19797 -
1.0 6.69 0.199 0.74 1.0.1 186 13.16
0 1 0.88 0.020 0.13 0.99 394 -
1 1 0.75 0.017 0.11 0.99 1052 -
2 1 3.34 0.097 0.57 1.00 1178 -
3 1 1.50 0.035 0.26 1.00 1211 -
4 1 20.59 0.940 0.88 1.00 1583 -
1.0 0.12 0.0030 0.04 0.97 2285 2.62
0 1 1.25 0.135 0.18 0.99 2480 -
1 1 0.03 0.001 0.04 0.97 7440 -
2 1 0.12 0.003 0.04 0.97 8199 -
3 1 1.10 0.092 0.16 0.99 11174 -
4 1 0.27 0.007 0.06 0.98 11310 -
0.244 0.07 0.0030 0.02 0.76 41314 1.32
0 0.181 0.64 0.028 0.03 0.36 41755 -
1 0.161 0.18 0.008 0.01 0.45 42420 -
2 0.161 0.18 0.008 0.01 0.45 42461 -
3 0.237 0.25 0.011 0.02 0.56 43060 -
4 0.267 1.03 0.047 0.07 0.46 43321 -
0.163 0.12 0.0060 0.01 0.5 103384 1.27
0 0.243 0.27 0.014 0.02 0.56 104693 -
1 0.215 0.66 0.029 0.04 0.41 105192 -
2 0.190 0.10 0.005 0.01 0.59 105758 -
3 0.161 0.12 0.006 0.01 0.50 109783 -
4 0.144 0.16 0.007 0.01 0.42 110067 -
Empty DataFrame
Columns: array([D, LOD, r2, CIlow, CIhi, Dist, T-int], dtype=object)
Index: array([], dtype=object)

df.append(dfin, ignore_index=True) returns a new DataFrame, it does not change df in place.
Use df = df.append(dfin, ignore_index=True). But even with this change i think this will not give what you need. Append extends a frame on axis=1 (columns), but i believe you want to combine the data on axis=0 (rows)
In this scenario (reading multiple files and use all data to create a single DataFrame), i would use pandas.concat(). The code below will give you a frame with columns named by colnames, and the rows are formed by the data in the csv files.
def files2df(colnames, ext):
files = sorted(glob.glob(ext))
frames = [read_csv(inf, sep='\t', skiprows=1, names=colnames) for inf in files]
return concat(frames, ignore_index=True)
I did not try this code, just wrote it here, maybe you need tweak it to get it running, but the idea is clear (i hope).

Also, I found another solution, but don't know which one is faster.
def files2df(colnames, ext):
dflist = [ ]
for inf in sorted(glob.glob(ext)):
dflist.append(read_csv(inf, names = colnames, sep='\t', skiprows=1))
#print(dflist)
df = concat(dflist, axis = 0, ignore_index=True)
#print(df.to_string())
return df

Related

Matching value to column index pandas

I have dataframe that looks like the following:
Given Y1 eY1 Y2 eY2 Y3 eY3 Y4 eY4 Y5
0 0.45 0.25 0.3550 0.39 0.4200 0.43 0.5950 0.65 0.7175 0.74
1 0.39 0.15 0.2400 0.27 0.5025 0.58 0.7675 0.83 0.8600 0.87
2 0.99 0.30 0.4875 0.55 0.7225 0.78 0.9075 0.95 0.9800 0.99
3 0.58 0.23 0.2825 0.30 0.5550 0.64 0.7075 0.73 0.8725 0.92
4 NaN 0.25 0.3625 0.40 0.6175 0.69 0.8100 0.85 0.9250 0.95
My goal is simple: try to match the "given" value in each row to the closest column index (columns are sorted in ascending order and output the closest column index to a new column. I have been stuck on this for some time and would greatly appreciate any help/starting tips.
(for any "Nan" values in Given, I am outputting "none")
Thank you!
First subtract all columns without Given by column Given by DataFrame.sub with absolute values and then use DataFrame.idxmin if not missing values in Given:
df1 = df.drop('Given', 1).sub(df['Given'], axis=0).abs()
print (df1)
Y1 eY1 Y2 eY2 Y3 eY3 Y4 eY4 Y5
0 0.20 0.0950 0.06 0.0300 0.02 0.1450 0.20 0.2675 0.29
1 0.24 0.1500 0.12 0.1125 0.19 0.3775 0.44 0.4700 0.48
2 0.69 0.5025 0.44 0.2675 0.21 0.0825 0.04 0.0100 0.00
3 0.35 0.2975 0.28 0.0250 0.06 0.1275 0.15 0.2925 0.34
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
df['new'] = np.where(df['Given'].isna(), None, df1.idxmin(axis=1))
print (df)
Given Y1 eY1 Y2 eY2 Y3 eY3 Y4 eY4 Y5 new
0 0.45 0.25 0.3550 0.39 0.4200 0.43 0.5950 0.65 0.7175 0.74 Y3
1 0.39 0.15 0.2400 0.27 0.5025 0.58 0.7675 0.83 0.8600 0.87 eY2
2 0.99 0.30 0.4875 0.55 0.7225 0.78 0.9075 0.95 0.9800 0.99 Y5
3 0.58 0.23 0.2825 0.30 0.5550 0.64 0.7075 0.73 0.8725 0.92 eY2
4 NaN 0.25 0.3625 0.40 0.6175 0.69 0.8100 0.85 0.9250 0.95 None

How to add rows up to point?

I have pandas series. I would like to add the rows to be equal to value.
For example, I would like to add the first few values to be equal to 0.19798863694528301.
Then add from the beginning so that the second value will be equal to 3.79478220811793, and so forth.
That data is given below:
0.0
0.015
0.03
0.045
0.06
0.075
0.09
0.105
0.12
0.135
0.15
0.165
0.18
0.195
0.21
0.225
0.24
0.255
0.27
0.285
0.3
0.315
0.33
0.345
0.36
0.375
0.39
0.405
0.42
0.435
0.45
0.465
0.48
0.495
0.51
0.525
0.54
0.555
0.57
0.585
0.6
0.615
0.63
0.645
0.66
0.675
0.69
0.705
0.72
0.735
0.75
0.765
0.78
0.795
0.81
0.825

Python: for loop iterations when adding dataframes

I have a dataframe with different returns looking something like:
0.2 -0.1 0.03 0.01
0.02 0.1 -0.1 -0.2
0.05 0.06 0.07 -0.07
0.03 -0.04 -0.04 -0.03
And I have a separate dataframe with the index returns in only one column:
0.01
0.015
-0.01
-0.02
What I want to do is to basically add(+) each row value of the index return dataframe with each value for each column in the stock return dataframe.
The desired outcome looks like:
0.21 -0.09
0.035 0.115
0.04 0.05
0.01 -0.06 etc etc
Normally in Matlab for example the for loop would be quite simple, but in python the indexing is what gets me stuck.
I have tried a simple for loop:
for i, j in df_stock_returns.iterrows():
df_new = df_stock_returns[i, j] + df_index_reuturns[j]
But that doesn't really work, any help is appreciated!
Assuming you have
In [27]: df
Out[27]:
0 1 2 3
0 0.20 -0.10 0.03 0.01
1 0.02 0.10 -0.10 -0.20
2 0.05 0.06 0.07 -0.07
3 0.03 -0.04 -0.04 -0.03
and
In [28]: dfi
Out[28]:
0
0 0.010
1 0.015
2 -0.010
3 -0.020
you can just write
In [26]: pd.concat([df[c] + dfi[0] for c in df], axis=1)
Out[26]:
0 0 1 2
0 0.210 -0.090 0.040 0.020
1 0.035 0.115 -0.085 -0.185
2 0.040 0.050 0.060 -0.080
3 0.010 -0.060 -0.060 -0.050
In pandas you almost never need to iterate over individual cells. Here I just iterated over the columns, and df[c] + dfi[0] adds the two columns element-wise. Then concat with axis=1 (0=rows, 1=columns) just concatenates everything into one dataframe.
I suppose the most straightforward way will work
for c in a.columns:
a[c] = a[c] + b
>>> a
0 1 2 3
0 0.210 -0.090 0.040 0.020
1 0.215 -0.085 0.045 0.025
2 0.190 -0.110 0.020 0.000
3 0.180 -0.120 0.010 -0.010
You can simply add two df as below
col1=[0.2,0.02]
col2=[-0.1,0.2]
col3=[0.01,0.015]
df1=pd.DataFrame(data=list(zip(col1, col2)),columns=['list1','list2'])
df2=pd.DataFrame({'list3':col3})
output = df1[:] + df2['list3'].values
The df1[:] extract all columns and it to the reference column df2['list3']

Is there a way to interpolate values while maintaining a ratio?

I have a dataframe of percentages, and I want to interpolate the intermediate values
0 5 10 15 20 25 30 35
A 0.50 0.50 0.50 0.49 0.47 0.41 0.35 0.29 0.22
B 0.31 0.31 0.31 0.29 0.28 0.24 0.22 0.18 0.13
C 0.09 0.09 0.09 0.09 0.08 0.07 0.06 0.05 0.04
D 0.08 0.08 0.08 0.08 0.06 0.06 0.05 0.04 0.03
E 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.03 0.04
F 0.01 0.01 0.01 0.04 0.10 0.20 0.30 0.41 0.54
So far, I've been using scipy's interp1d and iterating row by row, but it doesn't always maintain the ratios perfectly down the column. Is there a way to do this all together in one function?
reindex then interpolate
r = range(df.columns.min(), df.columns.max() + 1)
df.reindex(columns=r).interpolate(axis=1)
0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 36 37 38 39 40
A 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 ... 0.338 0.326 0.314 0.302 0.29 0.276 0.262 0.248 0.234 0.22
B 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 ... 0.212 0.204 0.196 0.188 0.18 0.170 0.160 0.150 0.140 0.13
C 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 ... 0.058 0.056 0.054 0.052 0.05 0.048 0.046 0.044 0.042 0.04
D 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 ... 0.048 0.046 0.044 0.042 0.04 0.038 0.036 0.034 0.032 0.03
E 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.022 0.024 0.026 0.028 0.03 0.032 0.034 0.036 0.038 0.04
F 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.322 0.344 0.366 0.388 0.41 0.436 0.462 0.488 0.514 0.54

Row based chart plot (Seaborn or Matplotlib)

Given that my data is a pandas dataframe and looks like this:
Ref +1 +2 +3 +4 +5 +6 +7
2013-05-28 1 -0.44 0.03 0.06 -0.31 0.13 0.56 0.81
2013-07-05 2 0.84 1.03 0.96 0.90 1.09 0.59 1.15
2013-08-21 3 0.09 0.25 0.06 0.09 -0.09 -0.16 0.56
2014-10-15 4 0.35 1.16 1.91 3.44 2.75 1.97 2.16
2015-02-09 5 0.09 -0.10 -0.38 -0.69 -0.25 -0.85 -0.47
How can I plot a chart of the 5 lines (1 for each ref), where the X axis are the columns (+1, +2...), and starts from 0? If is in seaborn, even better. But matplotlib solutions are also welcome.
Plotting a dataframe in pandas is generally all about reshaping the table so that the individual lines you want are in separate columns, and the x-values are in the index. Some of these reshape operations are a bit ugly, but you can do:
df = pd.read_clipboard()
plot_table = pd.melt(df.reset_index(), id_vars=['index', 'Ref'])
plot_table = plot_table.pivot(index='variable', columns='Ref', values='value')
# Add extra row to have all lines start from 0:
plot_table.loc['+0', :] = 0
plot_table = plot_table.sort_index()
plot_table
Ref 1 2 3 4 5
variable
+0 0.00 0.00 0.00 0.00 0.00
+1 -0.44 0.84 0.09 0.35 0.09
+2 0.03 1.03 0.25 1.16 -0.10
+3 0.06 0.96 0.06 1.91 -0.38
+4 -0.31 0.90 0.09 3.44 -0.69
+5 0.13 1.09 -0.09 2.75 -0.25
+6 0.56 0.59 -0.16 1.97 -0.85
+7 0.81 1.15 0.56 2.16 -0.47
Now that you have a table with the right shape, plotting is pretty automatic:
plot_table.plot()

Categories