I am new to working with Python and have the following problem with my calculations.
I have a table which contains NaN values.
The NaN values always occur at night, because no solar radiation can be measured there.
I want to replace all NaN values from a night with the value 4 hours before sunset.
I already tried to use the Ffill command, but since I don't need the last value before the NaN values, it doesn't work unfortunately.
For example:
a=[0.88, 0.84, 0.26, 0.50, 1.17, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, 0.73, 0.81]
The successive NaN values should all have the value 0.84.
The list should therefore look like this:
a=[0.88, 0.84, 0.26, 0.50, 1.17, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84, 0.73, 0.81]
Thanks in advance.
One option is to create a shifted and ffilled version of the original series and then just using that to fill in the nulls of the original data:
In [231]: s.fillna(s.shift(3).mask(s.isnull()).ffill())
Out[231]:
0 0.88
1 0.84
2 0.26
3 0.50
4 1.17
5 0.84
6 0.84
7 0.84
8 0.84
9 0.84
10 0.84
11 0.84
12 0.84
13 0.73
14 0.81
dtype: float64
import pandas as pd
a = [0.88, 0.84, 0.26, 0.50, 1.17, None, None, None, None, None, None, None, None, 0.73, 0.81]
df = pd.DataFrame(a)
df[3:] = df[3:].fillna(value=df.iloc[1, 0])
print(df)
0
0 0.88
1 0.84
2 0.26
3 0.50
4 1.17
5 0.84
6 0.84
7 0.84
8 0.84
9 0.84
10 0.84
11 0.84
12 0.84
13 0.73
14 0.81
a=[0.88, 0.84, 0.26, 0.50, 1.17, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 0.73, 0.81]
a = np.array(a)
a[np.isnan(a)] = a[1]
a
results:
array([0.88, 0.84, 0.26, 0.5 , 1.17, 0.84, 0.84, 0.84, 0.84, 0.84, 0.84,
0.84, 0.84, 0.73, 0.81])
Related
I have the following pandas dataframe:
df = pd.DataFrame({'pred': [1, 2, 3, 4],
'a': [0.4, 0.6, 0.35, 0.5],
'b': [0.2, 0.4, 0.32, 0.1],
'c': [0.1, 0, 0.2, 0.2],
'd': [0.3, 0, 0.1, 0.2]})
I want to change values on 'pred' column, based on columns a,b,c,d , as following:
if a has the value at column a is larger than the values of column b,c,d
and
if one of columns - b , c or d has value larger than 0.25
then change value in 'pred' to 0. so the results should be:
pred a b c d
0 1 0.4 0.2 0.1 0.1
1 0 0.6 0.4 0.0 0.0
2 0 0.35 0.32 0.2 0.3
3 4 0.5 0.1 0.2 0.2
How can I do this?
Create a boolean condition/mask then use loc to set value to 0 where condition is True
cols = ['b', 'c', 'd']
mask = df[cols].lt(df['a'], axis=0).all(1) & df[cols].gt(.25).any(1)
df.loc[mask, 'pred'] = 0
pred a b c d
0 1 0.40 0.20 0.1 0.1
1 0 0.60 0.40 0.0 0.0
2 0 0.35 0.32 0.2 0.3
3 4 0.50 0.10 0.2 0.2
import pandas as pd
def row_cond(row):
m_val = max(row[2:])
if row[1]>m_val and m_val>0.25:
row[0] = 0
return row
df = pd.DataFrame({'pred': [1, 2, 3, 4],
'a': [0.4, 0.6, 0.35, 0.5],
'b': [0.2, 0.4, 0.32, 0.1],
'c': [0.1, 0, 0.2, 0.2],
'd': [0.1, 0, 0.3, 0.2]})
new_df = df.apply(row_cond,axis=1)
Output:
pred a b c d
0 1.0 0.40 0.20 0.1 0.1
1 0.0 0.60 0.40 0.0 0.0
2 0.0 0.35 0.32 0.2 0.3
3 4.0 0.50 0.10 0.2 0.2
Consider the following data frames:
base_df = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7],
'type_a': ['nan', 'type3', 'type1', 'type2', 'type3', 'type5', 'type4'],
'q_a': [0, 0.9, 5.1, 3.0, 1.6, 1.1, 0.7],
'p_a': [0, 0.53, 0.71, 0.6, 0.53, 0.3, 0.33]
})
Edit: This is an extract of base_df. The original df 100 columns with around 500 observations.
table_df = pd.DataFrame({
'type': ['type1', 'type2', 'type3', 'type3', 'type3', 'type3', 'type4', 'type4', 'type4', 'type4', 'type5', 'type5', 'type5', 'type6', 'type6'],
'q_value': [5.1, 3.1, 1.6, 1.3, 0.9, 0.85, 0.7, 0.7, 0.7, 0.5, 1.2, 1.1, 1.1, 0.4, 0.4],
'p_value': [0.71, 0.62, 0.71, 0.54, 0.53, 0.44, 0.5, 0.54, 0.33, 0.33, 0.32, 0.31, 0.28, 0.31, 0.16],
'sigma':[2.88, 2.72, 2.73, 2.79, 2.91, 2.41, 2.63, 2.44, 2.7, 2.69, 2.59, 2.67, 2.4, 2.67, 2.35]
})
Edit: The original table_df looks exactly like this one.
For every observation in base_df, I'd like to look up if the type matches with an entry in table_df, if yes:
I'd like to look if there is an entry in table_df with the corresponding value q_a == q_value, if yes:
And there's only one value q_value, assign sigma to base_df.
If there are more than one values of q_value, compare p_a and assing the correct sigma to base_df.
If there's no exactly matching value for q_a or p_a just use the next bigger value, in case there is no bigger value use the lower one and assign the corresponding value for sigma to column sigma_a in base_df.
The resulting DF should look like this:
id type_a q_a p_a sigma_a
1 nan 0 0 0
2 type3 0.9 0.53 2.91
3 type1 5.1 0.71 2.88
4 type2 3 0.6 2.72
5 type3 1.6 0.53 2.41
6 type5 1.1 0.3 2.67
7 type4 0.7 0.33 2.7
So far I use the code below:
mapping = (pd.merge_asof(base_df.sort_values('q_a'),
table_df.sort_values('q_value'),
left_on='q_a',
left_by='type_a',
right_on='q_value',
right_by='type').set_index('id'))
base_df= (pd.merge_asof(base_df.sort_values('q_a'),
table_df.sort_values('q_value'),
left_on='q_a',
left_by='type_a',
right_on='q_value',
right_by='type',
direction = 'forward')
.set_index('id')
.combine_first(mapping)
.sort_index()
.reset_index()
)
This "two step check routine" works, but I'd like to add the third step checking p_value.
How can I realize it?
Actually, I think Metrics should not be separated into A-segment B-segment,
It supposed to concatenated into the same column and create a Metric like Segment.
Anyway, according to your description,
table_df is a reference table and they have the same criteria for _a and _b,
therefore I order them in hierarchical structure by following manipulation:
table_df.sort_values(by=["type","q_value","p_value"]).reset_index(drop = True)
type q_value p_value sigma
0 type1 5.10 0.71 2.88
1 type2 3.10 0.62 2.72
2 type3 0.85 0.44 2.41
3 type3 0.90 0.53 2.91
4 type3 1.30 0.54 2.79
5 type3 1.60 0.71 2.73
6 type4 0.50 0.33 2.69
7 type4 0.70 0.33 2.70
8 type4 0.70 0.50 2.63
9 type4 0.70 0.54 2.44
10 type5 1.10 0.28 2.40
11 type5 1.10 0.31 2.67
12 type5 1.20 0.32 2.59
13 type6 0.40 0.16 2.35
14 type6 0.40 0.31 2.67
table_df
type: a fully restrict condition
q-value&p-value: If there's no exactly matching value for q_a or p_a just use the next bigger value and assign the corresponding value for sigma to column sigma_a in base_df. If no bigger on, use the previous value in the reference table.
define the function for _a and _b (yeah they are the same)
find_sigma_a and find_sigma_b
def find_sigma_a(row):
sigma_value = table_df[
(table_df["type"]==row["type_a"]) &
(table_df["q_value"]>= row["q_a"]) &
(table_df["p_value"]>= row["p_a"])
]
if row["type_a"] == 'nan':
sigma_value = 0
elif len(sigma_value) == 0:
sigma_value = table_df[table_df["type"]==row["type_a"]].iloc[-1,3]
# .iloc[-1,3] alternatively term ["sigma"].tail(1).values[0]
else:
sigma_value = sigma_value.iloc[0,3]
# .iloc[0,3] alternatively term ["sigma"].head(1).values[0]
return sigma_value
def find_sigma_b(row):
sigma_value = table_df[
(table_df["type"] == row["type_b"]) &
(table_df["q_value"] >= row["q_b"]) &
(table_df["p_value"] >= row["p_b"])
]
if row["type_b"] == 'nan':
sigma_value = 0
elif len(sigma_value) == 0:
sigma_value = table_df[table_df["type"]==row["type_b"]].iloc[-1,3]
# .iloc[-1,3] alternatively term ["sigma"].tail(1).values[0]
else:
sigma_value = sigma_value.iloc[0,3]
# .iloc[0,3] alternatively term ["sigma"].head(1).values[0]
return sigma_value
and then use pandas.DataFrame.apply to apply these two functions
base_df["sigma_a"] = base_df.apply(find_sigma_a, axis = 1)
base_df["sigma_b"] = base_df.apply(find_sigma_b, axis = 1)
type_a q_a p_a type_b q_b p_b sigma_a sigma_b
0 nan 0.0 0.00 type6 0.4 0.11 0.00 2.35
1 type3 0.9 0.53 type3 1.4 0.60 2.91 2.73
2 type1 5.1 0.71 type3 0.9 0.53 2.88 2.91
3 type2 3.0 0.60 type6 0.5 0.40 2.72 2.67
4 type3 1.6 0.53 type6 0.4 0.11 2.73 2.35
5 type5 1.1 0.30 type1 4.9 0.70 2.67 2.88
6 type4 0.7 0.33 type4 0.7 0.20 2.70 2.70
arrange the columns:
base_df.iloc[:,[0,1,2,6,3,4,5,7]]
type_a q_a p_a sigma_a type_b q_b p_b sigma_b
0 nan 0.0 0.00 0.00 type6 0.4 0.11 2.35
1 type3 0.9 0.53 2.91 type3 1.4 0.60 2.73
2 type1 5.1 0.71 2.88 type3 0.9 0.53 2.91
3 type2 3.0 0.60 2.72 type6 0.5 0.40 2.67
4 type3 1.6 0.53 2.73 type6 0.4 0.11 2.35
5 type5 1.1 0.30 2.67 type1 4.9 0.70 2.88
6 type4 0.7 0.33 2.70 type4 0.7 0.20 2.70
Notebook_file
I have some DataFrame:
df = pd.DataFrame({'name': ['apple1', 'apple2', 'apple3', 'apple4', 'orange1', 'orange2', 'orange3', 'orange4'],
'A': [0, 0, 0, 0, 0, 0 ,0, 0],
'B': [0.10, -0.15, 0.25, -0.55, 0.50, -0.51, 0.70, 0],
'C': [0, 0, 0.25, -0.55, 0.50, -0.51, 0.70, 0.90],
'D': [0.10, -0.15, 0.25, 0, 0.50, -0.51, 0.70, 0.90]})
df
name A B C D
0 apple1 0 0.10 0.00 0.10
1 apple2 0 -0.15 0.00 -0.15
2 apple3 0 0.25 0.25 0.25
3 apple4 0 -0.55 -0.55 0.00
4 orange1 0 0.50 0.50 0.50
5 orange2 0 -0.51 -0.51 -0.51
6 orange3 0 0.70 0.70 0.70
7 orange4 0 0.00 0.90 0.90
I'd like to drop all rows that have two or more zeros in columns A,B,C,D.
This DataFrame has other columns that have zeros; I only want to check for zeros in columns A,B,C,D.
You can use .eq to check if dataframe is equal to 0 and then take sum on axis=1 and return a boolean series by checking if the sum is greater than or equal to 2 (ge):
df[~df[['A','B','C','D']].eq(0).sum(1).ge(2)]
name A B C D
2 apple3 0 0.25 0.25 0.25
4 orange1 0 0.50 0.50 0.50
5 orange2 0 -0.51 -0.51 -0.51
6 orange3 0 0.70 0.70 0.70
I have a data set with the following form:
A B C D E
0 0.5 0.2 0.25 0.75 1.25
1 0.5 0.3 0.12 0.41 1.40
2 0.5 0.4 0.85 0.15 1.55
3 1.0 0.2 0.11 0.15 1.25
4 1.0 0.3 0.10 0.11 1.40
5 1.0 0.4 0.87 0.14 1.25
6 2.0 0.2 0.23 0.45 1.55
7 2.0 0.3 0.74 0.85 1.25
8 2.0 0.4 0.55 0.55 1.40
Here is code to generate this DataFrame with pandas:
import pandas as pd
data = [[0.5, 0.2, 0.25, 0.75, 1.25],
[0.5, 0.3, 0.12, 0.41, 1.40],
[0.5, 0.4, 0.85, 0.15, 1.55],
[1.0, 0.2, 0.11, 0.15, 1.25],
[1.0, 0.3, 0.10, 0.11, 1.40],
[1.0, 0.4, 0.87, 0.14, 1.25],
[2.0, 0.2, 0.23, 0.45, 1.55],
[2.0, 0.3, 0.74, 0.85, 1.25],
[2.0, 0.4, 0.55, 0.55, 1.40]]
df = pd.DataFrame(data,columns=['A','B','C','D','E'])
This data represent an outcome of an experiment where for each A B and E there is a unique value C
What I want is to perform a linear interpolation so that I get similar data for A= 0.7 for instance based on the values of A=0.5 and A = 1.
the expected output should be something like :
A B C D E
0 0.5 0.2 0.25 0.75 1.25
1 0.5 0.3 0.12 0.41 1.40
2 0.5 0.4 0.85 0.15 1.55
3 0.7 0.2 xxx xxx 1.25
4 0.7 0.3 xxx xxx 1.40
5 0.7 0.4 xxx xxx 1.55
6 1.0 0.2 0.11 0.15 1.25
7 1.0 0.3 0.10 0.11 1.40
8 1.0 0.4 0.87 0.14 1.25
9 2.0 0.2 0.23 0.45 1.55
10 2.0 0.3 0.74 0.85 1.25
11 2.0 0.4 0.55 0.55 1.40
is there a straightforward way to do that in Python? I tried using the panda interpolate but the value I got didn't make sense.
Any suggestions?
Here is an example of how to create an interpolation function mapping values from column A to values from column C (arbitrarily picking 0.5 to 2.0 for values of A):
import pandas as pd
import numpy as np
from scipy import interpolate
# Set up the dataframe
data = [[0.5, 0.2, 0.25, 0.75, 1.25],
[0.5, 0.3, 0.12, 0.41, 1.40],
[0.5, 0.4, 0.85, 0.15, 1.55],
[1.0, 0.2, 0.11, 0.15, 1.25],
[1.0, 0.3, 0.10, 0.11, 1.40],
[1.0, 0.4, 0.87, 0.14, 1.25],
[2.0, 0.2, 0.23, 0.45, 1.55],
[2.0, 0.3, 0.74, 0.85, 1.25],
[2.0, 0.4, 0.55, 0.55, 1.40]]
df = pd.DataFrame(data,columns=['A','B','C','D','E'])
# Create the interpolation function
f = interpolate.interp1d(df['A'], df['C'])
# Evaluate new A (x) values to get new C (y) values via interpolation
xnew = np.linspace(0.5, 2.0, 10)
ynew = f(xnew)
print("%-7s %-7s"%("A","C"))
print("-"*16)
for x, y in zip(xnew, ynew):
print("%0.4f\t%0.4f"%(x,y))
The result:
A C
----------------
0.5000 0.8500
0.6667 0.6033
0.8333 0.3567
1.0000 0.8700
1.1667 0.7633
1.3333 0.6567
1.5000 0.5500
1.6667 0.4433
1.8333 0.3367
2.0000 0.5500
Can pandas combine multiple lists of readings and return the maximum reading values for the elements in aoiFeatures?
Given:
# FYI: 2.4 million elements in each of these lists in reality
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851]
allReadings1 = [0.27, 0.25, 0.13, 0.04, 0.05, 0.09, 0.15, 0.13, 0.12, 0.20]
allReadings2 = [0.25, 0.06, 0.29, 0.29, 0.04, 0.21, 0.07, 0.06, 0.07, 0.06]
allReadings3 = [0.12, 0.02, 0.20, 0.27, 0.04, 0.08, 0.11, 0.24, 0.00, 0.13]
allReadings4 = [0.21, 0.00, 0.22, 0.11, 0.24, 0.16, 0.11, 0.18, 0.27, 0.14]
allReadings5 = [0.02, 0.18, 0.26, 0.22, 0.23, 0.15, 0.24, 0.28, 0.00, 0.07]
allReadings6 = [0.08, 0.25, 0.21, 0.23, 0.14, 0.21, 0.18, 0.09, 0.17, 0.27]
allReadings7 = [0.20, 0.02, 0.28, 0.16, 0.18, 0.27, 0.29, 0.19, 0.29, 0.13]
allReadings8 = [0.17, 0.01, 0.07, 0.23, 0.14, 0.20, 0.19, 0.01, 0.15, 0.17]
allReadings9 = [0.12, 0.18, 0.09, 0.10, 0.00, 0.03, 0.11, 0.03, 0.14, 0.14]
allReadings10 =[0.13, 0.03, 0.20, 0.13, 0.30, 0.30, 0.28, 0.12, 0.19, 0.22]
# FYI: 67,000 elements in this list in reality
aoiFeatures = [181, 843, 849]
Result:
181 0.29
843 0.27
849 0.29
First zip all lists together with DataFrame contructor and index parameter, select rows by loc and get max values:
L = list(zip(allReadings1,
allReadings2,
allReadings3,
allReadings4,
allReadings5,
allReadings6,
allReadings7,
allReadings8,
allReadings9,
allReadings10))
df = pd.DataFrame(L, index=allFeatures)
print (df)
0 1 2 3 4 5 6 7 8 9
101 0.27 0.25 0.12 0.21 0.02 0.08 0.20 0.17 0.12 0.13
179 0.25 0.06 0.02 0.00 0.18 0.25 0.02 0.01 0.18 0.03
181 0.13 0.29 0.20 0.22 0.26 0.21 0.28 0.07 0.09 0.20
183 0.04 0.29 0.27 0.11 0.22 0.23 0.16 0.23 0.10 0.13
185 0.05 0.04 0.04 0.24 0.23 0.14 0.18 0.14 0.00 0.30
843 0.09 0.21 0.08 0.16 0.15 0.21 0.27 0.20 0.03 0.30
845 0.15 0.07 0.11 0.11 0.24 0.18 0.29 0.19 0.11 0.28
847 0.13 0.06 0.24 0.18 0.28 0.09 0.19 0.01 0.03 0.12
849 0.12 0.07 0.00 0.27 0.00 0.17 0.29 0.15 0.14 0.19
851 0.20 0.06 0.13 0.14 0.07 0.27 0.13 0.17 0.14 0.22
aoiFeatures = [181, 843, 849]
s = df.loc[aoiFeatures].max(axis=1)
print (s)
181 0.29
843 0.30
849 0.29
dtype: float64
Option 1
You can let Python's max do the work and use pandas.Series to hold the results
readings = [allReadings1, allReadings2, allReadings3, allReadings4, allReadings5,
allReadings6, allReadings7, allReadings8, allReadings9, allReadings10]
s = pd.Series(dict(zip(allFeatures, map(max, zip(*readings)))))
s[aoiFeatures]
181 0.29
843 0.30
849 0.29
dtype: float64
Option 2
Or leverage Numpy
readings = [allReadings1, allReadings2, allReadings3, allReadings4, allReadings5,
allReadings6, allReadings7, allReadings8, allReadings9, allReadings10]
s = pd.Series(np.max(readings, 0), allFeatures)
s[aoiFeatures]
181 0.29
843 0.30
849 0.29
dtype: float64
If you needed to update the array of maximums with a new reading
allReadings11 =[0.13, 0.03, 0.30, 0.13, 0.30, 0.30, 0.28, 0.12, 0.19, 0.22]
s[:] = np.maximum(s, allReadings11)
s[aoiFeatures]
181 0.29
843 0.30
849 0.29
dtype: float64
Very simple and quick task:
pd.DataFrame([allReadings1, allReadings2,...],columns=allFeatures).max()
Sample output: