Supposed I have a df as below, how to add a sum() value in below DataFrame?
df.columns=['value_a','value_b','name','up_or_down','difference']
df
value_a value_b name up_or_down difference
project_name
# sum 27.56 25.04 sum down -1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
I tried
df.sum().columns=['value_a_sun','value_b_sum','difference_sum']
And I would like to add below sum value in the above column value,
sum 27.56 25.04 sum down -1.31
But I got AttributeError: 'Series' object has no attribute 'column', how to fix this? Thanks so much for any advice.
Filter columns names in subset by [] before sum and assign for new row in DataFrame.loc:
df.loc['sum'] = df[['value_a','value_b','difference']].sum()
For first line:
df1 = df[['value_a','value_b','difference']].sum().to_frame().T
df = pd.concat([df1, df], ignore_index=True)
I am trying to concat two pivot tables but after join the two tables, the columns lost.
Pivot1:
SATISFIED_CHECKOUT 1.0 2.0 3.0 4.0 5.0
SEGMENT
BOTH_TX_SPEND_GROWN 0.01 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 NaN 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.01 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.04 0.02 0.15 0.79
Pivot2:
SATISFIED_FOOD 1.0 2.0 3.0 4.0 5.0
SEGMENT
BOTH_TX_SPEND_GROWN 0.00 0.01 0.07 0.20 0.71
BOTH_TX_SPEND_NO_GROWTH 0.00 0.01 0.08 0.19 0.71
ONLY_SHOPPED_2018 0.01 0.01 0.07 0.19 0.71
ONLY_SHOPPED_2019 0.00 0.01 0.10 0.19 0.69
ONLY_SPEND_GROWN 0.00 0.01 0.08 0.18 0.72
ONLY_TX_GROWN 0.00 0.02 0.07 0.19 0.72
SHOPPED_NEITHER NaN NaN 0.10 0.20 0.70
The original df looks like below:
SATISFIED_CHECKOUT SATISFIED_FOOD Persona
1 1 BOTH_TX_SPEND_GROWN
2 3 BOTH_TX_SPEND_NO_GROWTH
3 2 ONLY_SHOPPED_2019
.... .... ............
5 3 ONLY_SHOPPED_2019
I am using the code:
a = pd.pivot_table(df,index=["SEGMENT"], columns=["SATISFIED_FOOD"], aggfunc='size').apply(lambda x: x / x.sum(), axis=1).round(2)
b = pd.pivot_table(df,index=["SEGMENT"], columns=["SATISFIED_CHECKOUT"], aggfunc='size').apply(lambda x: x / x.sum(), axis=1).round(2)
pd.concat([a, b],axis=1)
The result like below:
1.0 2.0 3.0 4.0 ... 2.0 3.0 4.0 5.0
SEGMENT ...
BOTH_TX_SPEND_GROWN 0.01 0.03 0.07 0.23 ... 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.06 0.22 ... 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 0.01 0.04 0.08 0.24 ... 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.03 0.08 0.25 ... 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.00 0.03 0.07 0.22 ... 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.02 0.05 0.22 ... 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.01 0.07 0.28 ... 0.04 0.02 0.15 0.79
[7 rows x 15 columns]
But what I want to see this the result like below:
SATISFIED_CHECKOUT SATISFIED_FOOD
1.0 2.0 3.0 4.0 ... 2.0 3.0 4.0 5.0
SEGMENT ...
BOTH_TX_SPEND_GROWN 0.01 0.03 0.07 0.23 ... 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.06 0.22 ... 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 0.01 0.04 0.08 0.24 ... 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.03 0.08 0.25 ... 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.00 0.03 0.07 0.22 ... 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.02 0.05 0.22 ... 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.01 0.07 0.28 ... 0.04 0.02 0.15 0.79
[7 rows x 15 columns]
I have a dataframe of percentages, and I want to interpolate the intermediate values
0 5 10 15 20 25 30 35
A 0.50 0.50 0.50 0.49 0.47 0.41 0.35 0.29 0.22
B 0.31 0.31 0.31 0.29 0.28 0.24 0.22 0.18 0.13
C 0.09 0.09 0.09 0.09 0.08 0.07 0.06 0.05 0.04
D 0.08 0.08 0.08 0.08 0.06 0.06 0.05 0.04 0.03
E 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.03 0.04
F 0.01 0.01 0.01 0.04 0.10 0.20 0.30 0.41 0.54
So far, I've been using scipy's interp1d and iterating row by row, but it doesn't always maintain the ratios perfectly down the column. Is there a way to do this all together in one function?
reindex then interpolate
r = range(df.columns.min(), df.columns.max() + 1)
df.reindex(columns=r).interpolate(axis=1)
0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 36 37 38 39 40
A 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 ... 0.338 0.326 0.314 0.302 0.29 0.276 0.262 0.248 0.234 0.22
B 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 ... 0.212 0.204 0.196 0.188 0.18 0.170 0.160 0.150 0.140 0.13
C 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 ... 0.058 0.056 0.054 0.052 0.05 0.048 0.046 0.044 0.042 0.04
D 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 ... 0.048 0.046 0.044 0.042 0.04 0.038 0.036 0.034 0.032 0.03
E 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.022 0.024 0.026 0.028 0.03 0.032 0.034 0.036 0.038 0.04
F 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.322 0.344 0.366 0.388 0.41 0.436 0.462 0.488 0.514 0.54
I have my data file contain 7500 lines with :
Y1C 1.53 -0.06 0.58 0.52 0.42 0.16 0.79 -0.6 -0.3
-0.78 -0.14 0.38 0.34 0.23 0.26 -1.8 -0.1 -0.17 0.3
0.6 0.9 0.71 0.5 0.49 1.06 0.25 0.96 -0.39 0.24 0.69
0.41 0.7 -0.16 -0.39 0.6 1.04 0.4 -0.04 0.36 0.23 -0.14
-0.09 0.15 -0.46 -0.05 0.32 -0.54 -0.28 -0.15 1.34 0.29
0.59 -0.43 -0.55 -0.18 -0.01 0.68 -0.06 -0.11 -0.67
-0.25 -0.34 -0.38 0.02 -0.21 0.12 0.01 0.07 0.15 0.14
0.15 -0.11 0.07 -0.41 -0.2 0.24 0.06 0.12 0.12 0.11
0.1 0.24 -0.71 0.22 -0.02 0.15 0.84 1.39 0.13 0.48
0.19 -0.23 -0.12 0.33 0.37 0.18 0.06 0.32 0.09
-0.09 0.02 -0.01 -0.06 -0.23 0.52 0.14 0.24 -0.05 0.37
0.1 0.45 0.38 1.34 0.74 0.5 0.92 0.91 1.34 1.78 2.26
0.05 0.29 0.53 0.17 0.41 0.47 0.47 1.21 0.87 0.68
1.08 0.89 0.13 0.5 0.57 -0.5 -0.78 -0.34 -0.3 0.54
0.31 0.64 1.23 0.335 0.36 -0.65 0.39 0.39 0.31 0.73
0.54 0.3 0.26 0.47 0.13 0.24 -0.6 0.02 0.11 0.27
0.21 -0.3 -1 -0.44 -0.15 -0.51 0.3 0.14 -0.15 -0.27 -0.27
Y2W -0.01 -0.3 0.23 0.01 -0.15 0.45 -0.04 0.14 -1.16
-0.14 -0.56 -0.13 0.77 0.77 -0.57 0.48 0.22 -0.08
-1.81 -0.46 -0.17 0.2 -0.18 -0.45 -0.4 1.35 0.81 1.21
0.52 0.02 -0.06 0.37 0 -0.38 -0.02 0.48 0 0.58 0.81
0.54 0.18 -0.11 0.03 0.1 -0.38 0.17 0.37 -0.05 0.13
-0.01 -0.17 0.36 0.22 0 -1.4 -0.67 -0.45 -0.62 -0.58
-0.47 -0.86 -1.12 -0.43 0.1 0.06 -0.45 -0.14 0.68 -0.16
0.14 0.14 0.18 0.14 0.17 0.13 0.07 0.05 0.04 0.07
-0.01 0.03 0.05 0.02 0.12 0.34 -0.04 -0.75 1.68 0.23
0.49 0.38 -0.57 0.17 -0.04 0.19 0.22 0.29 -0.04 -0.3
0.18 0.04 0.3 -0.06 -0.07 -0.21 -0.01 0.51 -0.04 -0.04
-0.23 0.06 0.9 -0.14 0.19 2.5 2.84 3.27 2.13 2.5 2.66
4.16 3.52 -0.12 0.13 0.44 0.32 0.44 0.46 0.7 0.68
0.99 0.83 0.74 0.51 0.33 0.22 0.01 0.33 -0.19 0.4
0.41 0.07 0.18 -0.01 0.45 -0.37 -0.49 1.02 -0.59
-1.44 -1.53 -0.74 -1.48 0.12 0.05 0.02 -0.1 0.57
-0.36 0.1 -0.16 -0.23 -0.34 -0.61 -0.37 -0.14 -0.22 -0.27
-0.08 -0.08 -0.17 0.18 -0.74
Y3W 0.15 -0.07 -0.25 -0.3 -1.12 -0.67 -0.15 -0.43 0.63
0.92 0.25 0.33 0.81 -0.12 -0.12 0.67 0.86 0.86
1.54 -0.3 0 -0.29 -0.74 0.15 0.59 0.15 0.34 0.23
0.5 0.52 0.25 0.86 0.53 0.51 0.25 -1.29 -1.79
-0.45 -0.64 0.01 -0.58 -0.51 -0.74 -1.32 -0.47
-0.81 0.55 -0.09 0.46 -0.3 -0.2 -0.81 -1.56 -2.74 1.03
1 1.01 0.29 -0.64 -1.03 0.07 0.46 0.33 0.04 -0.6
-0.64 -0.51 -0.36 -0.1 0.13 -1.4 -1.17 -0.64 -0.16 -0.5
-0.47 0.75 0.62 0.7 1.06 0.93 0.56 -2.25 -0.94 -0.09
0.08 -0.15 -1.6 -1.43 -0.84 -0.25 -1.22 -0.92 -1.22
-0.97 -0.84 -0.89 0.24 0 -0.04 -0.64 -0.94 -1.56 -2.32
0.63 -0.17 -3.06 -2.4 -2 -1.4 -0.81 -1.6 -3.06 -1.79
0.17 0.28 -0.67 -2.82 -1.47 -1.82 -1.69 -1.38 -1.96
-1.88 -2.34 -3.06 -0.18 0.5 -0.03 -0.49 -0.61 -0.54 -0.37
0.1 -0.92 -1.79 -0.03 -0.54 0.94 -1 0.15 0.95 0.55
-0.36 0.4 -0.73 0.85 -0.26 0.55 0.14 -0.36 0.38 0.87
0.62 0.66 0.79 -0.67 0.48 0.62 0.48 0.72 0.73 0.29
-0.3 -0.81
Y4W 0.24 0.76 0.2 0.34 0.11 0.07 0.01 0.36 0.4 -0.25
-0.45 0.09 -0.97 0.19 0.28 -1.81 -0.64 -0.49 -1.27
-0.95 -0.1 0.12 -0.1 0 -0.08 0.77 1.02 0.92 0.56
0.1 0.7 0.57 0.16 1.29 0.82 0.55 0.5 1.83 1.79 0.01
0.24 -0.67 -0.85 -0.42 -0.37 0.2 0.07 -0.01 -0.17 -0.2
-0.43 -0.34 0.12 -0.21 -0.23 -0.22 -0.1 -0.07 -0.61
-0.14 -0.43 -0.97 0.27 0.7 0.54 0.11 -0.5 -0.39 0.01
0.61 0.88 1 0.35 0.67 0.6 0.78 0.46 0.09 -0.06
-0.35 0.08 -0.14 -0.32 -0.11 0 0.01 0.02 0.77 0.18
0.36 -1.15 -0.42 -0.19 0.06 -0.25 -0.81 -0.63 -1.43
-0.85 -0.88 -0.68 -0.59 -1.01 -0.68 -0.71 0.15 0.08 0.08
-0.03 -0.2 0.03 -0.18 -0.01 -0.08 -1.61 -0.67 -0.74
-0.54 -0.8 -1.02 -0.84 -1.91 -0.22 -0.02 0.05 -0.84
-0.65 -0.82 -0.4 -0.83 -0.9 -1.04 -1.23 -0.91 0.28 0.68
0.57 -0.02 0.4 -1.52 0.17 0.44 -1.18 0.04 0.17 0.16
0.04 -0.26 0.04 0.1 -0.11 -0.64 -0.09 -0.16 0.16 -0.05
0.39 0.39 -0.06 0.46 0.2 -0.45 -0.35 -1.3 -0.26 -0.29
0.02 0.16 0.18 -0.35 -0.45 -1.04 -0.69
Y5C 2.85 3.34 -1 -0.47 -0.66 -0.03 1.41
0.8 0 0.41 -0.14 -0.86 -0.79 -1.69 0 0 1.52
1.29 0.84 0.58 1.02 1.35 0.45 1.02 1.47 0.82 0.46
0.25 0.77 0.93 -0.58 -0.67 -0.18 -0.56 -0.01 0.25
-0.71 -0.49 -0.43 0 -1.06 0.44 -0.29 0.26 -0.04
-0.14 -0.1 -0.12 -1.6 0.33 0.62 0.52 0.7 -0.22 0.44
-0.6 0.86 1.19 1.58 0.93 1 0.85 1.24 1.06 0.49
0.26 0.18 0.3 -0.09 -0.42 0.05 0.54 0.24 0.37 0.86
0.9 0.49 -1.47 -0.2 -0.43 0.2 0.1 -0.81 -0.74 -1.36 -0.97
-0.94 -0.86 -1.56 -1.89 -1.89 -1.06 0.12 0.06 0.04
-0.01 -0.12 0.01 -0.15 0.76 0.89 0.71 -1.12 0.03
-0.86 0.26 0 -0.25 -0.06 0.19 0.41 0.58 -0.46 0.01
-0.15 0.04 -1.01 -0.57 -0.71 -0.3 -1.01 1.83 0.59
1.04 -1.43 0.38 0.65 -6.64 -0.42 0.24 0.46 0.96 0.24
0.7 1.21 0.6 0.12 0.77 -0.03 0.53 0.31 0.46 0.51
-0.45 0.23 0.32 -0.34 -0.1 0.1 -0.45 0.74 -0.06 0.21
0.29 0.45 0.68 0.29 0.45
Y7C -0.22 -0.12 -0.29 -0.51 -0.81 -0.47 0.28 -0.1 0.15
0.38 0.18 -0.27 0.12 -0.15 0.43 0.25 0.19 0.33 0.67
0.86 -0.56 -0.29 -0.36 -0.42 0.08 0.04 -0.04 0.15 0.38
-0.07 -0.1 -0.2 -0.03 -0.29 0.06 0.65 0.58 0.86 2.05
0.3 0.33 -0.29 -0.23 -0.15 -0.32 0.08 0.34 0.15 0
-0.01 0.28 0.36 0.25 0.46 0.4 0.7 0.49 0.97 1.04
0.36 -0.47 -0.29 0.77 0.57 0.45 0.77 0.24 -0.23 0.12
0.49 0.62 0.49 0.84 0.89 1.08 0.87 -0.18 -0.43
-0.39 -0.18 -0.02 0.01 0.2 -0.2 -0.03 0.01 0.25 0.1
-0.07 -1.43 -0.2 -0.4 0.32 0.72 -0.42 -0.3 -0.38
-0.22 -0.81 -1.15 -1.6 -1.89 -2.06 -2.4 0.08 0.34 0.1
-0.15 -0.06 -0.17 -0.47 -0.4 0.15 -1.22 -1.43 -1.03
-1.03 -1.64 -1.84 -2.64 -2 0.05 0.4 0.88 -1.54 -1.21
-1.46 -1.92 -1.52 -1.92 -1.7 -1.94 -1.86 -0.1 -0.02
-0.22 -0.34 -0.48 0.28 0 0.14 0.4 -0.29 -0.27 -0.3
-0.67 -0.09 0.23 0.33 0.23 0.1 0.38 -0.51 0.23 -0.73
0.22 -0.47 0.24 0.68 0.53 0.23 -0.1 0.11 -0.18 0.16
0.68 0.55 0.28 -0.03 0.03 0.08 0.12
There is a missing value, I wanted to load it as matrix I used :
data = np.genfromtxt("This_data.txt", delimiter='\t', missing_values=np.nan)
When I print data I get :
Traceback (most recent call last):
File "matrix.py", line 8, in <module>
data = np.genfromtxt("This_data.txt", delimiter='\t', missing_values=np.nan ,usecols=np.arange(0,174))
File "/home/anaconda2/lib/python2.7/numpy/lib/npyio.py", line 1769, in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #25 (got 172 columns instead of 174)
I used to put:
data = np.genfromtxt("This_data.txt", delimiter='\t', missing_values=np.nan ,usecols=np.arange(0,174))
But I have same errors. Any suggestion?
A short sample bytestring substitute for a file:
In [168]: txt = b"""Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81
...: """
Minimal load with correct delimiter. Note the first column is nan, because it can't convert the strings to float.
In [169]: np.genfromtxt(txt.splitlines(),delimiter='\t')
Out[169]:
array([[ nan, -0.22, -0.12, -0.29, -0.51, -0.81],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81]])
with dtype=None it sets each column dtype automatically, creating a structured array:
In [170]: np.genfromtxt(txt.splitlines(),delimiter='\t',dtype=None)
Out[170]:
array([(b'Y7C', -0.22, -0.12, -0.29, -0.51, -0.81),
(b'Y7C', -0.22, -0.12, -0.29, -0.51, -0.81),
(b'Y7C', -0.22, -0.12, -0.29, -0.51, -0.81)],
dtype=[('f0', 'S3'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8')])
Spell out the columns to use, skipping the first:
In [172]: np.genfromtxt(txt.splitlines(),delimiter='\t',usecols=np.arange(1,6))
Out[172]:
array([[-0.22, -0.12, -0.29, -0.51, -0.81],
[-0.22, -0.12, -0.29, -0.51, -0.81],
[-0.22, -0.12, -0.29, -0.51, -0.81]])
But if I ask for more columns that it finds I get an error, like yours:
In [173]: np.genfromtxt(txt.splitlines(),delimiter='\t',usecols=np.arange(1,7))
---------------------------------------------------------------------------
....
ValueError: Some errors were detected !
Line #1 (got 6 columns instead of 6)
Line #2 (got 6 columns instead of 6)
Line #3 (got 6 columns instead of 6)
Your missing_values parameters doesn't help; that's the wrong use for that
This is the correct use of missing_values - to detect the string value and replace it with a valid float value:
In [177]: np.genfromtxt(txt.splitlines(),delimiter='\t',missing_values='Y7C',filling_val
...: ues=0)
Out[177]:
array([[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81]])
If the file has sufficient delimiters, it can treat those as missing values
In [178]: txt = b"""Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81\t\t
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81\t\t
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81\t\t
...: """
In [179]: np.genfromtxt(txt.splitlines(),delimiter='\t')
Out[179]:
array([[ nan, -0.22, -0.12, -0.29, -0.51, -0.81, nan, nan],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81, nan, nan],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81, nan, nan]])
In [180]: np.genfromtxt(txt.splitlines(),delimiter='\t',filling_values=0)
Out[180]:
array([[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81, 0. , 0. ],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81, 0. , 0. ],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81, 0. , 0. ]])
I believe the pandas csv reader can handle 'ragged' columns and missing values better.
Evidently the program does not like the fact that you have missing values, probably because you're generating a matrix, so it doesn't like replacing missing values with Nans. Try adding 0's in the places with missing values, or at least the tab delimiter so that it will register as having all 174 columns.
I would like to feed a empty dataframe appending several files of the same type and structure. However, I can't see what's wrong here:
def files2df(colnames, ext):
df = DataFrame(columns = colnames)
for inf in sorted(glob.glob(ext)):
dfin = read_csv(inf, sep='\t', skiprows=1)
print(dfin.head(), '\n')
df.append(dfin, ignore_index=True)
return df
The resulting dataframe is empty. Could someone give me a hand?
1.0 16.59 0.597 0.87 1.0.1 3282 100.08
0 0.953 14.52 0.561 0.80 0.99 4355 -
1 1.000 31.59 1.000 0.94 1.00 6322 -
2 1.000 6.09 0.237 0.71 1.00 10568 -
3 1.000 31.29 1.000 0.94 1.00 14363 -
4 1.000 31.59 1.000 0.94 1.00 19797 -
1.0 6.69 0.199 0.74 1.0.1 186 13.16
0 1 0.88 0.020 0.13 0.99 394 -
1 1 0.75 0.017 0.11 0.99 1052 -
2 1 3.34 0.097 0.57 1.00 1178 -
3 1 1.50 0.035 0.26 1.00 1211 -
4 1 20.59 0.940 0.88 1.00 1583 -
1.0 0.12 0.0030 0.04 0.97 2285 2.62
0 1 1.25 0.135 0.18 0.99 2480 -
1 1 0.03 0.001 0.04 0.97 7440 -
2 1 0.12 0.003 0.04 0.97 8199 -
3 1 1.10 0.092 0.16 0.99 11174 -
4 1 0.27 0.007 0.06 0.98 11310 -
0.244 0.07 0.0030 0.02 0.76 41314 1.32
0 0.181 0.64 0.028 0.03 0.36 41755 -
1 0.161 0.18 0.008 0.01 0.45 42420 -
2 0.161 0.18 0.008 0.01 0.45 42461 -
3 0.237 0.25 0.011 0.02 0.56 43060 -
4 0.267 1.03 0.047 0.07 0.46 43321 -
0.163 0.12 0.0060 0.01 0.5 103384 1.27
0 0.243 0.27 0.014 0.02 0.56 104693 -
1 0.215 0.66 0.029 0.04 0.41 105192 -
2 0.190 0.10 0.005 0.01 0.59 105758 -
3 0.161 0.12 0.006 0.01 0.50 109783 -
4 0.144 0.16 0.007 0.01 0.42 110067 -
Empty DataFrame
Columns: array([D, LOD, r2, CIlow, CIhi, Dist, T-int], dtype=object)
Index: array([], dtype=object)
df.append(dfin, ignore_index=True) returns a new DataFrame, it does not change df in place.
Use df = df.append(dfin, ignore_index=True). But even with this change i think this will not give what you need. Append extends a frame on axis=1 (columns), but i believe you want to combine the data on axis=0 (rows)
In this scenario (reading multiple files and use all data to create a single DataFrame), i would use pandas.concat(). The code below will give you a frame with columns named by colnames, and the rows are formed by the data in the csv files.
def files2df(colnames, ext):
files = sorted(glob.glob(ext))
frames = [read_csv(inf, sep='\t', skiprows=1, names=colnames) for inf in files]
return concat(frames, ignore_index=True)
I did not try this code, just wrote it here, maybe you need tweak it to get it running, but the idea is clear (i hope).
Also, I found another solution, but don't know which one is faster.
def files2df(colnames, ext):
dflist = [ ]
for inf in sorted(glob.glob(ext)):
dflist.append(read_csv(inf, names = colnames, sep='\t', skiprows=1))
#print(dflist)
df = concat(dflist, axis = 0, ignore_index=True)
#print(df.to_string())
return df