Can pandas combine multiple lists of readings and return the maximum reading values for the elements in aoiFeatures?
Given:
# FYI: 2.4 million elements in each of these lists in reality
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851]
allReadings1 = [0.27, 0.25, 0.13, 0.04, 0.05, 0.09, 0.15, 0.13, 0.12, 0.20]
allReadings2 = [0.25, 0.06, 0.29, 0.29, 0.04, 0.21, 0.07, 0.06, 0.07, 0.06]
allReadings3 = [0.12, 0.02, 0.20, 0.27, 0.04, 0.08, 0.11, 0.24, 0.00, 0.13]
allReadings4 = [0.21, 0.00, 0.22, 0.11, 0.24, 0.16, 0.11, 0.18, 0.27, 0.14]
allReadings5 = [0.02, 0.18, 0.26, 0.22, 0.23, 0.15, 0.24, 0.28, 0.00, 0.07]
allReadings6 = [0.08, 0.25, 0.21, 0.23, 0.14, 0.21, 0.18, 0.09, 0.17, 0.27]
allReadings7 = [0.20, 0.02, 0.28, 0.16, 0.18, 0.27, 0.29, 0.19, 0.29, 0.13]
allReadings8 = [0.17, 0.01, 0.07, 0.23, 0.14, 0.20, 0.19, 0.01, 0.15, 0.17]
allReadings9 = [0.12, 0.18, 0.09, 0.10, 0.00, 0.03, 0.11, 0.03, 0.14, 0.14]
allReadings10 =[0.13, 0.03, 0.20, 0.13, 0.30, 0.30, 0.28, 0.12, 0.19, 0.22]
# FYI: 67,000 elements in this list in reality
aoiFeatures = [181, 843, 849]
Result:
181 0.29
843 0.27
849 0.29
First zip all lists together with DataFrame contructor and index parameter, select rows by loc and get max values:
L = list(zip(allReadings1,
allReadings2,
allReadings3,
allReadings4,
allReadings5,
allReadings6,
allReadings7,
allReadings8,
allReadings9,
allReadings10))
df = pd.DataFrame(L, index=allFeatures)
print (df)
0 1 2 3 4 5 6 7 8 9
101 0.27 0.25 0.12 0.21 0.02 0.08 0.20 0.17 0.12 0.13
179 0.25 0.06 0.02 0.00 0.18 0.25 0.02 0.01 0.18 0.03
181 0.13 0.29 0.20 0.22 0.26 0.21 0.28 0.07 0.09 0.20
183 0.04 0.29 0.27 0.11 0.22 0.23 0.16 0.23 0.10 0.13
185 0.05 0.04 0.04 0.24 0.23 0.14 0.18 0.14 0.00 0.30
843 0.09 0.21 0.08 0.16 0.15 0.21 0.27 0.20 0.03 0.30
845 0.15 0.07 0.11 0.11 0.24 0.18 0.29 0.19 0.11 0.28
847 0.13 0.06 0.24 0.18 0.28 0.09 0.19 0.01 0.03 0.12
849 0.12 0.07 0.00 0.27 0.00 0.17 0.29 0.15 0.14 0.19
851 0.20 0.06 0.13 0.14 0.07 0.27 0.13 0.17 0.14 0.22
aoiFeatures = [181, 843, 849]
s = df.loc[aoiFeatures].max(axis=1)
print (s)
181 0.29
843 0.30
849 0.29
dtype: float64
Option 1
You can let Python's max do the work and use pandas.Series to hold the results
readings = [allReadings1, allReadings2, allReadings3, allReadings4, allReadings5,
allReadings6, allReadings7, allReadings8, allReadings9, allReadings10]
s = pd.Series(dict(zip(allFeatures, map(max, zip(*readings)))))
s[aoiFeatures]
181 0.29
843 0.30
849 0.29
dtype: float64
Option 2
Or leverage Numpy
readings = [allReadings1, allReadings2, allReadings3, allReadings4, allReadings5,
allReadings6, allReadings7, allReadings8, allReadings9, allReadings10]
s = pd.Series(np.max(readings, 0), allFeatures)
s[aoiFeatures]
181 0.29
843 0.30
849 0.29
dtype: float64
If you needed to update the array of maximums with a new reading
allReadings11 =[0.13, 0.03, 0.30, 0.13, 0.30, 0.30, 0.28, 0.12, 0.19, 0.22]
s[:] = np.maximum(s, allReadings11)
s[aoiFeatures]
181 0.29
843 0.30
849 0.29
dtype: float64
Very simple and quick task:
pd.DataFrame([allReadings1, allReadings2,...],columns=allFeatures).max()
Sample output:
Related
I am new to working with Python and have the following problem with my calculations.
I have a table which contains NaN values.
The NaN values always occur at night, because no solar radiation can be measured there.
I want to replace all NaN values from a night with the value 4 hours before sunset.
I already tried to use the Ffill command, but since I don't need the last value before the NaN values, it doesn't work unfortunately.
For example:
a=[0.88, 0.84, 0.26, 0.50, 1.17, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, 0.73, 0.81]
The successive NaN values should all have the value 0.84.
The list should therefore look like this:
a=[0.88, 0.84, 0.26, 0.50, 1.17, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.73, 0.81]
Thanks in advance.
One option is to create a shifted and ffilled version of the original series and then just using that to fill in the nulls of the original data:
In [231]: s.fillna(s.shift(3).mask(s.isnull()).ffill())
Out[231]:
0 0.88
1 0.84
2 0.26
3 0.50
4 1.17
5 0.84
6 0.84
7 0.84
8 0.84
9 0.84
10 0.84
11 0.84
12 0.84
13 0.73
14 0.81
dtype: float64
import pandas as pd
a = [0.88, 0.84, 0.26, 0.50, 1.17, None, None, None, None, None, None, None, None, 0.73, 0.81]
df = pd.DataFrame(a)
df[3:] = df[3:].fillna(value=df.iloc[1, 0])
print(df)
0
0 0.88
1 0.84
2 0.26
3 0.50
4 1.17
5 0.84
6 0.84
7 0.84
8 0.84
9 0.84
10 0.84
11 0.84
12 0.84
13 0.73
14 0.81
a=[0.88, 0.84, 0.26, 0.50, 1.17, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 0.73, 0.81]
a = np.array(a)
a[np.isnan(a)] = a[1]
a
results:
array([0.88, 0.84, 0.26, 0.5 , 1.17, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84,
0.84, 0.84, 0.73, 0.81])
in[31]: day_1_variable.shape
out[31]: (241, 241)
This is the dictionary of 10 numpy arrays with 241 * 241 (rows * columns)
df_dictionary = {'arrays_to_iterate': {'day_1': day_1_variable,
'day_2': day_2_variable,
'day_3': day_3_variable,
.
.
.
.
.
.
'day_10': day_10_variable}}
day = 10
for days in np.arange(1,day+1):
numpy_array_to_iterate = df_dictionary ['arrays_to_iterate']['day_'+str(days)+'_rf']
variable_value_array=np.zeros((0),dtype='float') ## empty array of variable value created
for i in np.arange(numpy_array_to_iterate.shape[0]): ## iterating array rows
for j in np.arange(numpy_array_to_iterate.shape[1]): ## iterating array column
variable_value_at_specific_point=numpy_array_to_iterate[i][j]
variable_value_array=np.append(variable_value_array,variable_value_at_specific_point) ## values filled in array
df_xyz = pd.DataFrame()
for i in np.arange(1,day+1):
col_name = 'variable_day_' + str(i)
df_xyz.loc[:, col_name] = variable_value_array
df_xyz
I want to store the array data of each day in column of pandas dataframe having value of the variable for each corresponding day
But the output that I am getting here has value of last day in each column
variable_day_1 variable_day_2 ........... variable_day_10
0 0.0625 0.0625 ........... 0.0625
1 0.0625 0.0625 ........... 0.0625
2 0.0625 0.0625 ........... 0.0625
3 0.0625 0.0625 ........... 0.0625
4 0.0625 0.0625 ........... 0.0625
... ... ... ... ... ... ... ... ... ... ...
58076 0.0000 0.0000 ........... 0.0000
58077 0.0000 0.0000 ........... 0.0000
58078 0.0000 0.0000 ........... 0.0000
58079 0.0000 0.0000 ........... 0.0000
58080 0.0000 0.0000 ........... 0.0000
58081 rows × 10 columns
How to do so?
Use Numpy stack over the dictionary values (this will give you a Numpy array with shape (10, 241, 241)) then use reshape to modify the shape to (10,58081) follow by transpose, to place the days as columns. Next, convert to a Pandas dataframe and fix the column names using the dictionary keys.
import pandas as pd
import numpy as np
#setup
np.random.seed(12345)
df_dictionary = {}
days = {f'day_{d}': np.random.rand(241,241).round(2) for d in range(1,11)}
df_dictionary['arrays_to_iterate'] = days
print(df_dictionary)
#code
all_days = np.stack(list(df_dictionary['arrays_to_iterate'].values())).reshape(10, -1).T
df = pd.DataFrame(all_days)
df.columns = df_dictionary['arrays_to_iterate'].keys()
print(df)
Ouput from df_dictionary
{'arrays_to_iterate':
{'day_1':
array(
[[0.93, 0.32, 0.18, ..., 0.62, 0.89, 0.78],
[0.72, 0.31, 0.36, ..., 0.5 , 0.89, 0.38],
...,
[0.36, 0.62, 0.77, ..., 0.03, 0.57, 0.04],
[0.02, 0.07, 0.66, ..., 0.62, 0.5 , 0.04]]),
'day_2': array(
[[0.14, 0.13, 0.91, ..., 0.06, 0.72, 0.93],
[0.13, 0.02, 0.09, ..., 0.39, 0.72, 0.13],
...
Output from df
day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10
0 0.93 0.14 0.06 0.10 0.01 0.66 0.67 0.18 0.93 0.40
1 0.32 0.13 0.81 0.57 0.23 0.60 0.48 0.07 0.08 0.32
2 0.18 0.91 0.95 0.27 0.36 0.11 0.25 0.71 0.24 0.44
3 0.20 0.51 0.52 0.62 0.09 0.31 0.19 0.78 0.83 0.58
4 0.57 0.14 0.89 0.51 0.67 0.29 0.48 0.95 0.36 0.97
... ... ... ... ... ... ... ... ... ... ...
58076 0.98 0.20 0.54 0.96 0.89 0.24 0.05 0.81 0.35 0.57
58077 0.53 0.96 0.04 0.60 0.16 0.38 0.83 0.49 0.28 0.02
58078 0.62 0.50 0.74 0.67 0.43 0.30 0.91 0.68 0.15 0.43
58079 0.50 0.11 0.57 0.42 0.85 0.97 0.86 0.60 0.75 0.33
58080 0.04 0.74 0.74 0.94 0.98 0.35 0.52 0.12 0.47 0.53
[58081 rows x 10 columns]
Consider the following data frames:
base_df = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7],
'type_a': ['nan', 'type3', 'type1', 'type2', 'type3', 'type5', 'type4'],
'q_a': [0, 0.9, 5.1, 3.0, 1.6, 1.1, 0.7],
'p_a': [0, 0.53, 0.71, 0.6, 0.53, 0.3, 0.33]
})
Edit: This is an extract of base_df. The original df 100 columns with around 500 observations.
table_df = pd.DataFrame({
'type': ['type1', 'type2', 'type3', 'type3', 'type3', 'type3', 'type4', 'type4', 'type4', 'type4', 'type5', 'type5', 'type5', 'type6', 'type6'],
'q_value': [5.1, 3.1, 1.6, 1.3, 0.9, 0.85, 0.7, 0.7, 0.7, 0.5, 1.2, 1.1, 1.1, 0.4, 0.4],
'p_value': [0.71, 0.62, 0.71, 0.54, 0.53, 0.44, 0.5, 0.54, 0.33, 0.33, 0.32, 0.31, 0.28, 0.31, 0.16],
'sigma':[2.88, 2.72, 2.73, 2.79, 2.91, 2.41, 2.63, 2.44, 2.7, 2.69, 2.59, 2.67, 2.4, 2.67, 2.35]
})
Edit: The original table_df looks exactly like this one.
For every observation in base_df, I'd like to look up if the type matches with an entry in table_df, if yes:
I'd like to look if there is an entry in table_df with the corresponding value q_a == q_value, if yes:
And there's only one value q_value, assign sigma to base_df.
If there are more than one values of q_value, compare p_a and assing the correct sigma to base_df.
If there's no exactly matching value for q_a or p_a just use the next bigger value, in case there is no bigger value use the lower one and assign the corresponding value for sigma to column sigma_a in base_df.
The resulting DF should look like this:
id type_a q_a p_a sigma_a
1 nan 0 0 0
2 type3 0.9 0.53 2.91
3 type1 5.1 0.71 2.88
4 type2 3 0.6 2.72
5 type3 1.6 0.53 2.41
6 type5 1.1 0.3 2.67
7 type4 0.7 0.33 2.7
So far I use the code below:
mapping = (pd.merge_asof(base_df.sort_values('q_a'),
table_df.sort_values('q_value'),
left_on='q_a',
left_by='type_a',
right_on='q_value',
right_by='type').set_index('id'))
base_df= (pd.merge_asof(base_df.sort_values('q_a'),
table_df.sort_values('q_value'),
left_on='q_a',
left_by='type_a',
right_on='q_value',
right_by='type',
direction = 'forward')
.set_index('id')
.combine_first(mapping)
.sort_index()
.reset_index()
)
This "two step check routine" works, but I'd like to add the third step checking p_value.
How can I realize it?
Actually, I think Metrics should not be separated into A-segment B-segment,
It supposed to concatenated into the same column and create a Metric like Segment.
Anyway, according to your description,
table_df is a reference table and they have the same criteria for _a and _b,
therefore I order them in hierarchical structure by following manipulation:
table_df.sort_values(by=["type","q_value","p_value"]).reset_index(drop = True)
type q_value p_value sigma
0 type1 5.10 0.71 2.88
1 type2 3.10 0.62 2.72
2 type3 0.85 0.44 2.41
3 type3 0.90 0.53 2.91
4 type3 1.30 0.54 2.79
5 type3 1.60 0.71 2.73
6 type4 0.50 0.33 2.69
7 type4 0.70 0.33 2.70
8 type4 0.70 0.50 2.63
9 type4 0.70 0.54 2.44
10 type5 1.10 0.28 2.40
11 type5 1.10 0.31 2.67
12 type5 1.20 0.32 2.59
13 type6 0.40 0.16 2.35
14 type6 0.40 0.31 2.67
table_df
type: a fully restrict condition
q-value&p-value: If there's no exactly matching value for q_a or p_a just use the next bigger value and assign the corresponding value for sigma to column sigma_a in base_df. If no bigger on, use the previous value in the reference table.
define the function for _a and _b (yeah they are the same)
find_sigma_a and find_sigma_b
def find_sigma_a(row):
sigma_value = table_df[
(table_df["type"]==row["type_a"]) &
(table_df["q_value"]>= row["q_a"]) &
(table_df["p_value"]>= row["p_a"])
]
if row["type_a"] == 'nan':
sigma_value = 0
elif len(sigma_value) == 0:
sigma_value = table_df[table_df["type"]==row["type_a"]].iloc[-1,3]
# .iloc[-1,3] alternatively term ["sigma"].tail(1).values[0]
else:
sigma_value = sigma_value.iloc[0,3]
# .iloc[0,3] alternatively term ["sigma"].head(1).values[0]
return sigma_value
def find_sigma_b(row):
sigma_value = table_df[
(table_df["type"] == row["type_b"]) &
(table_df["q_value"] >= row["q_b"]) &
(table_df["p_value"] >= row["p_b"])
]
if row["type_b"] == 'nan':
sigma_value = 0
elif len(sigma_value) == 0:
sigma_value = table_df[table_df["type"]==row["type_b"]].iloc[-1,3]
# .iloc[-1,3] alternatively term ["sigma"].tail(1).values[0]
else:
sigma_value = sigma_value.iloc[0,3]
# .iloc[0,3] alternatively term ["sigma"].head(1).values[0]
return sigma_value
and then use pandas.DataFrame.apply to apply these two functions
base_df["sigma_a"] = base_df.apply(find_sigma_a, axis = 1)
base_df["sigma_b"] = base_df.apply(find_sigma_b, axis = 1)
type_a q_a p_a type_b q_b p_b sigma_a sigma_b
0 nan 0.0 0.00 type6 0.4 0.11 0.00 2.35
1 type3 0.9 0.53 type3 1.4 0.60 2.91 2.73
2 type1 5.1 0.71 type3 0.9 0.53 2.88 2.91
3 type2 3.0 0.60 type6 0.5 0.40 2.72 2.67
4 type3 1.6 0.53 type6 0.4 0.11 2.73 2.35
5 type5 1.1 0.30 type1 4.9 0.70 2.67 2.88
6 type4 0.7 0.33 type4 0.7 0.20 2.70 2.70
arrange the columns:
base_df.iloc[:,[0,1,2,6,3,4,5,7]]
type_a q_a p_a sigma_a type_b q_b p_b sigma_b
0 nan 0.0 0.00 0.00 type6 0.4 0.11 2.35
1 type3 0.9 0.53 2.91 type3 1.4 0.60 2.73
2 type1 5.1 0.71 2.88 type3 0.9 0.53 2.91
3 type2 3.0 0.60 2.72 type6 0.5 0.40 2.67
4 type3 1.6 0.53 2.73 type6 0.4 0.11 2.35
5 type5 1.1 0.30 2.67 type1 4.9 0.70 2.88
6 type4 0.7 0.33 2.70 type4 0.7 0.20 2.70
Notebook_file
I have a data set with the following form:
A B C D E
0 0.5 0.2 0.25 0.75 1.25
1 0.5 0.3 0.12 0.41 1.40
2 0.5 0.4 0.85 0.15 1.55
3 1.0 0.2 0.11 0.15 1.25
4 1.0 0.3 0.10 0.11 1.40
5 1.0 0.4 0.87 0.14 1.25
6 2.0 0.2 0.23 0.45 1.55
7 2.0 0.3 0.74 0.85 1.25
8 2.0 0.4 0.55 0.55 1.40
Here is code to generate this DataFrame with pandas:
import pandas as pd
data = [[0.5, 0.2, 0.25, 0.75, 1.25],
[0.5, 0.3, 0.12, 0.41, 1.40],
[0.5, 0.4, 0.85, 0.15, 1.55],
[1.0, 0.2, 0.11, 0.15, 1.25],
[1.0, 0.3, 0.10, 0.11, 1.40],
[1.0, 0.4, 0.87, 0.14, 1.25],
[2.0, 0.2, 0.23, 0.45, 1.55],
[2.0, 0.3, 0.74, 0.85, 1.25],
[2.0, 0.4, 0.55, 0.55, 1.40]]
df = pd.DataFrame(data,columns=['A','B','C','D','E'])
This data represent an outcome of an experiment where for each A B and E there is a unique value C
What I want is to perform a linear interpolation so that I get similar data for A= 0.7 for instance based on the values of A=0.5 and A = 1.
the expected output should be something like :
A B C D E
0 0.5 0.2 0.25 0.75 1.25
1 0.5 0.3 0.12 0.41 1.40
2 0.5 0.4 0.85 0.15 1.55
3 0.7 0.2 xxx xxx 1.25
4 0.7 0.3 xxx xxx 1.40
5 0.7 0.4 xxx xxx 1.55
6 1.0 0.2 0.11 0.15 1.25
7 1.0 0.3 0.10 0.11 1.40
8 1.0 0.4 0.87 0.14 1.25
9 2.0 0.2 0.23 0.45 1.55
10 2.0 0.3 0.74 0.85 1.25
11 2.0 0.4 0.55 0.55 1.40
is there a straightforward way to do that in Python? I tried using the panda interpolate but the value I got didn't make sense.
Any suggestions?
Here is an example of how to create an interpolation function mapping values from column A to values from column C (arbitrarily picking 0.5 to 2.0 for values of A):
import pandas as pd
import numpy as np
from scipy import interpolate
# Set up the dataframe
data = [[0.5, 0.2, 0.25, 0.75, 1.25],
[0.5, 0.3, 0.12, 0.41, 1.40],
[0.5, 0.4, 0.85, 0.15, 1.55],
[1.0, 0.2, 0.11, 0.15, 1.25],
[1.0, 0.3, 0.10, 0.11, 1.40],
[1.0, 0.4, 0.87, 0.14, 1.25],
[2.0, 0.2, 0.23, 0.45, 1.55],
[2.0, 0.3, 0.74, 0.85, 1.25],
[2.0, 0.4, 0.55, 0.55, 1.40]]
df = pd.DataFrame(data,columns=['A','B','C','D','E'])
# Create the interpolation function
f = interpolate.interp1d(df['A'], df['C'])
# Evaluate new A (x) values to get new C (y) values via interpolation
xnew = np.linspace(0.5, 2.0, 10)
ynew = f(xnew)
print("%-7s %-7s"%("A","C"))
print("-"*16)
for x, y in zip(xnew, ynew):
print("%0.4f\t%0.4f"%(x,y))
The result:
A C
----------------
0.5000 0.8500
0.6667 0.6033
0.8333 0.3567
1.0000 0.8700
1.1667 0.7633
1.3333 0.6567
1.5000 0.5500
1.6667 0.4433
1.8333 0.3367
2.0000 0.5500
I have my data file contain 7500 lines with :
Y1C 1.53 -0.06 0.58 0.52 0.42 0.16 0.79 -0.6 -0.3
-0.78 -0.14 0.38 0.34 0.23 0.26 -1.8 -0.1 -0.17 0.3
0.6 0.9 0.71 0.5 0.49 1.06 0.25 0.96 -0.39 0.24 0.69
0.41 0.7 -0.16 -0.39 0.6 1.04 0.4 -0.04 0.36 0.23 -0.14
-0.09 0.15 -0.46 -0.05 0.32 -0.54 -0.28 -0.15 1.34 0.29
0.59 -0.43 -0.55 -0.18 -0.01 0.68 -0.06 -0.11 -0.67
-0.25 -0.34 -0.38 0.02 -0.21 0.12 0.01 0.07 0.15 0.14
0.15 -0.11 0.07 -0.41 -0.2 0.24 0.06 0.12 0.12 0.11
0.1 0.24 -0.71 0.22 -0.02 0.15 0.84 1.39 0.13 0.48
0.19 -0.23 -0.12 0.33 0.37 0.18 0.06 0.32 0.09
-0.09 0.02 -0.01 -0.06 -0.23 0.52 0.14 0.24 -0.05 0.37
0.1 0.45 0.38 1.34 0.74 0.5 0.92 0.91 1.34 1.78 2.26
0.05 0.29 0.53 0.17 0.41 0.47 0.47 1.21 0.87 0.68
1.08 0.89 0.13 0.5 0.57 -0.5 -0.78 -0.34 -0.3 0.54
0.31 0.64 1.23 0.335 0.36 -0.65 0.39 0.39 0.31 0.73
0.54 0.3 0.26 0.47 0.13 0.24 -0.6 0.02 0.11 0.27
0.21 -0.3 -1 -0.44 -0.15 -0.51 0.3 0.14 -0.15 -0.27 -0.27
Y2W -0.01 -0.3 0.23 0.01 -0.15 0.45 -0.04 0.14 -1.16
-0.14 -0.56 -0.13 0.77 0.77 -0.57 0.48 0.22 -0.08
-1.81 -0.46 -0.17 0.2 -0.18 -0.45 -0.4 1.35 0.81 1.21
0.52 0.02 -0.06 0.37 0 -0.38 -0.02 0.48 0 0.58 0.81
0.54 0.18 -0.11 0.03 0.1 -0.38 0.17 0.37 -0.05 0.13
-0.01 -0.17 0.36 0.22 0 -1.4 -0.67 -0.45 -0.62 -0.58
-0.47 -0.86 -1.12 -0.43 0.1 0.06 -0.45 -0.14 0.68 -0.16
0.14 0.14 0.18 0.14 0.17 0.13 0.07 0.05 0.04 0.07
-0.01 0.03 0.05 0.02 0.12 0.34 -0.04 -0.75 1.68 0.23
0.49 0.38 -0.57 0.17 -0.04 0.19 0.22 0.29 -0.04 -0.3
0.18 0.04 0.3 -0.06 -0.07 -0.21 -0.01 0.51 -0.04 -0.04
-0.23 0.06 0.9 -0.14 0.19 2.5 2.84 3.27 2.13 2.5 2.66
4.16 3.52 -0.12 0.13 0.44 0.32 0.44 0.46 0.7 0.68
0.99 0.83 0.74 0.51 0.33 0.22 0.01 0.33 -0.19 0.4
0.41 0.07 0.18 -0.01 0.45 -0.37 -0.49 1.02 -0.59
-1.44 -1.53 -0.74 -1.48 0.12 0.05 0.02 -0.1 0.57
-0.36 0.1 -0.16 -0.23 -0.34 -0.61 -0.37 -0.14 -0.22 -0.27
-0.08 -0.08 -0.17 0.18 -0.74
Y3W 0.15 -0.07 -0.25 -0.3 -1.12 -0.67 -0.15 -0.43 0.63
0.92 0.25 0.33 0.81 -0.12 -0.12 0.67 0.86 0.86
1.54 -0.3 0 -0.29 -0.74 0.15 0.59 0.15 0.34 0.23
0.5 0.52 0.25 0.86 0.53 0.51 0.25 -1.29 -1.79
-0.45 -0.64 0.01 -0.58 -0.51 -0.74 -1.32 -0.47
-0.81 0.55 -0.09 0.46 -0.3 -0.2 -0.81 -1.56 -2.74 1.03
1 1.01 0.29 -0.64 -1.03 0.07 0.46 0.33 0.04 -0.6
-0.64 -0.51 -0.36 -0.1 0.13 -1.4 -1.17 -0.64 -0.16 -0.5
-0.47 0.75 0.62 0.7 1.06 0.93 0.56 -2.25 -0.94 -0.09
0.08 -0.15 -1.6 -1.43 -0.84 -0.25 -1.22 -0.92 -1.22
-0.97 -0.84 -0.89 0.24 0 -0.04 -0.64 -0.94 -1.56 -2.32
0.63 -0.17 -3.06 -2.4 -2 -1.4 -0.81 -1.6 -3.06 -1.79
0.17 0.28 -0.67 -2.82 -1.47 -1.82 -1.69 -1.38 -1.96
-1.88 -2.34 -3.06 -0.18 0.5 -0.03 -0.49 -0.61 -0.54 -0.37
0.1 -0.92 -1.79 -0.03 -0.54 0.94 -1 0.15 0.95 0.55
-0.36 0.4 -0.73 0.85 -0.26 0.55 0.14 -0.36 0.38 0.87
0.62 0.66 0.79 -0.67 0.48 0.62 0.48 0.72 0.73 0.29
-0.3 -0.81
Y4W 0.24 0.76 0.2 0.34 0.11 0.07 0.01 0.36 0.4 -0.25
-0.45 0.09 -0.97 0.19 0.28 -1.81 -0.64 -0.49 -1.27
-0.95 -0.1 0.12 -0.1 0 -0.08 0.77 1.02 0.92 0.56
0.1 0.7 0.57 0.16 1.29 0.82 0.55 0.5 1.83 1.79 0.01
0.24 -0.67 -0.85 -0.42 -0.37 0.2 0.07 -0.01 -0.17 -0.2
-0.43 -0.34 0.12 -0.21 -0.23 -0.22 -0.1 -0.07 -0.61
-0.14 -0.43 -0.97 0.27 0.7 0.54 0.11 -0.5 -0.39 0.01
0.61 0.88 1 0.35 0.67 0.6 0.78 0.46 0.09 -0.06
-0.35 0.08 -0.14 -0.32 -0.11 0 0.01 0.02 0.77 0.18
0.36 -1.15 -0.42 -0.19 0.06 -0.25 -0.81 -0.63 -1.43
-0.85 -0.88 -0.68 -0.59 -1.01 -0.68 -0.71 0.15 0.08 0.08
-0.03 -0.2 0.03 -0.18 -0.01 -0.08 -1.61 -0.67 -0.74
-0.54 -0.8 -1.02 -0.84 -1.91 -0.22 -0.02 0.05 -0.84
-0.65 -0.82 -0.4 -0.83 -0.9 -1.04 -1.23 -0.91 0.28 0.68
0.57 -0.02 0.4 -1.52 0.17 0.44 -1.18 0.04 0.17 0.16
0.04 -0.26 0.04 0.1 -0.11 -0.64 -0.09 -0.16 0.16 -0.05
0.39 0.39 -0.06 0.46 0.2 -0.45 -0.35 -1.3 -0.26 -0.29
0.02 0.16 0.18 -0.35 -0.45 -1.04 -0.69
Y5C 2.85 3.34 -1 -0.47 -0.66 -0.03 1.41
0.8 0 0.41 -0.14 -0.86 -0.79 -1.69 0 0 1.52
1.29 0.84 0.58 1.02 1.35 0.45 1.02 1.47 0.82 0.46
0.25 0.77 0.93 -0.58 -0.67 -0.18 -0.56 -0.01 0.25
-0.71 -0.49 -0.43 0 -1.06 0.44 -0.29 0.26 -0.04
-0.14 -0.1 -0.12 -1.6 0.33 0.62 0.52 0.7 -0.22 0.44
-0.6 0.86 1.19 1.58 0.93 1 0.85 1.24 1.06 0.49
0.26 0.18 0.3 -0.09 -0.42 0.05 0.54 0.24 0.37 0.86
0.9 0.49 -1.47 -0.2 -0.43 0.2 0.1 -0.81 -0.74 -1.36 -0.97
-0.94 -0.86 -1.56 -1.89 -1.89 -1.06 0.12 0.06 0.04
-0.01 -0.12 0.01 -0.15 0.76 0.89 0.71 -1.12 0.03
-0.86 0.26 0 -0.25 -0.06 0.19 0.41 0.58 -0.46 0.01
-0.15 0.04 -1.01 -0.57 -0.71 -0.3 -1.01 1.83 0.59
1.04 -1.43 0.38 0.65 -6.64 -0.42 0.24 0.46 0.96 0.24
0.7 1.21 0.6 0.12 0.77 -0.03 0.53 0.31 0.46 0.51
-0.45 0.23 0.32 -0.34 -0.1 0.1 -0.45 0.74 -0.06 0.21
0.29 0.45 0.68 0.29 0.45
Y7C -0.22 -0.12 -0.29 -0.51 -0.81 -0.47 0.28 -0.1 0.15
0.38 0.18 -0.27 0.12 -0.15 0.43 0.25 0.19 0.33 0.67
0.86 -0.56 -0.29 -0.36 -0.42 0.08 0.04 -0.04 0.15 0.38
-0.07 -0.1 -0.2 -0.03 -0.29 0.06 0.65 0.58 0.86 2.05
0.3 0.33 -0.29 -0.23 -0.15 -0.32 0.08 0.34 0.15 0
-0.01 0.28 0.36 0.25 0.46 0.4 0.7 0.49 0.97 1.04
0.36 -0.47 -0.29 0.77 0.57 0.45 0.77 0.24 -0.23 0.12
0.49 0.62 0.49 0.84 0.89 1.08 0.87 -0.18 -0.43
-0.39 -0.18 -0.02 0.01 0.2 -0.2 -0.03 0.01 0.25 0.1
-0.07 -1.43 -0.2 -0.4 0.32 0.72 -0.42 -0.3 -0.38
-0.22 -0.81 -1.15 -1.6 -1.89 -2.06 -2.4 0.08 0.34 0.1
-0.15 -0.06 -0.17 -0.47 -0.4 0.15 -1.22 -1.43 -1.03
-1.03 -1.64 -1.84 -2.64 -2 0.05 0.4 0.88 -1.54 -1.21
-1.46 -1.92 -1.52 -1.92 -1.7 -1.94 -1.86 -0.1 -0.02
-0.22 -0.34 -0.48 0.28 0 0.14 0.4 -0.29 -0.27 -0.3
-0.67 -0.09 0.23 0.33 0.23 0.1 0.38 -0.51 0.23 -0.73
0.22 -0.47 0.24 0.68 0.53 0.23 -0.1 0.11 -0.18 0.16
0.68 0.55 0.28 -0.03 0.03 0.08 0.12
There is a missing value, I wanted to load it as matrix I used :
data = np.genfromtxt("This_data.txt", delimiter='\t', missing_values=np.nan)
When I print data I get :
Traceback (most recent call last):
File "matrix.py", line 8, in <module>
data = np.genfromtxt("This_data.txt", delimiter='\t', missing_values=np.nan ,usecols=np.arange(0,174))
File "/home/anaconda2/lib/python2.7/numpy/lib/npyio.py", line 1769, in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #25 (got 172 columns instead of 174)
I used to put:
data = np.genfromtxt("This_data.txt", delimiter='\t', missing_values=np.nan ,usecols=np.arange(0,174))
But I have same errors. Any suggestion?
A short sample bytestring substitute for a file:
In [168]: txt = b"""Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81
...: """
Minimal load with correct delimiter. Note the first column is nan, because it can't convert the strings to float.
In [169]: np.genfromtxt(txt.splitlines(),delimiter='\t')
Out[169]:
array([[ nan, -0.22, -0.12, -0.29, -0.51, -0.81],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81]])
with dtype=None it sets each column dtype automatically, creating a structured array:
In [170]: np.genfromtxt(txt.splitlines(),delimiter='\t',dtype=None)
Out[170]:
array([(b'Y7C', -0.22, -0.12, -0.29, -0.51, -0.81),
(b'Y7C', -0.22, -0.12, -0.29, -0.51, -0.81),
(b'Y7C', -0.22, -0.12, -0.29, -0.51, -0.81)],
dtype=[('f0', 'S3'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8')])
Spell out the columns to use, skipping the first:
In [172]: np.genfromtxt(txt.splitlines(),delimiter='\t',usecols=np.arange(1,6))
Out[172]:
array([[-0.22, -0.12, -0.29, -0.51, -0.81],
[-0.22, -0.12, -0.29, -0.51, -0.81],
[-0.22, -0.12, -0.29, -0.51, -0.81]])
But if I ask for more columns that it finds I get an error, like yours:
In [173]: np.genfromtxt(txt.splitlines(),delimiter='\t',usecols=np.arange(1,7))
---------------------------------------------------------------------------
....
ValueError: Some errors were detected !
Line #1 (got 6 columns instead of 6)
Line #2 (got 6 columns instead of 6)
Line #3 (got 6 columns instead of 6)
Your missing_values parameters doesn't help; that's the wrong use for that
This is the correct use of missing_values - to detect the string value and replace it with a valid float value:
In [177]: np.genfromtxt(txt.splitlines(),delimiter='\t',missing_values='Y7C',filling_val
...: ues=0)
Out[177]:
array([[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81]])
If the file has sufficient delimiters, it can treat those as missing values
In [178]: txt = b"""Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81\t\t
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81\t\t
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81\t\t
...: """
In [179]: np.genfromtxt(txt.splitlines(),delimiter='\t')
Out[179]:
array([[ nan, -0.22, -0.12, -0.29, -0.51, -0.81, nan, nan],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81, nan, nan],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81, nan, nan]])
In [180]: np.genfromtxt(txt.splitlines(),delimiter='\t',filling_values=0)
Out[180]:
array([[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81, 0. , 0. ],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81, 0. , 0. ],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81, 0. , 0. ]])
I believe the pandas csv reader can handle 'ragged' columns and missing values better.
Evidently the program does not like the fact that you have missing values, probably because you're generating a matrix, so it doesn't like replacing missing values with Nans. Try adding 0's in the places with missing values, or at least the tab delimiter so that it will register as having all 174 columns.