I have a df:
1 2 3 4 5 6 7 8 9 10
A 10 0 0 15 0 21 45 0 0 7
I am trying fill index A values with the current value if the next value is 0 so that the df would look like this:
1 2 3 4 5 6 7 8 9 10
A 10 10 10 15 15 21 45 45 45 7
I tried:
df.loc[['A']].replace(to_replace=0, method='ffill').values
But this does not work, where is my mistake?
If you want to use your method, you need to work with Series on both sides:
df.loc['A'] = df.loc['A'].replace(to_replace=0, method='ffill')
Alternatively, you can mask the 0 with NaNs, and ffill the data on axis=1:
df.mask(df.eq(0)).ffill(axis=1)
output:
1 2 3 4 5 6 7 8 9 10
A 10.0 10.0 10.0 15.0 15.0 21.0 45.0 45.0 45.0 7.0
Well you should change your code a little bit and work with series:
import pandas as pd
df = pd.DataFrame({'1': [10], '2': [0], '3': [0], '4': [15], '5': [0],
'6': [21], '7': [45], '8': [0], '9': [0], '10': [7]},
index=['A'])
print(df.apply(lambda x: pd.Series(x.values).replace(to_replace=0, method='ffill').values, axis=1))
Output:
A [10, 10, 10, 15, 15, 21, 45, 45, 45, 7]
dtype: object
This way, if you have multiple indices, the code still works:
import pandas as pd
df = pd.DataFrame({'1': [10, 11], '2': [0, 12], '3': [0, 0], '4': [15, 0], '5': [0, 3],
'6': [21, 3], '7': [45, 0], '8': [0, 4], '9': [0, 5], '10': [7, 0]},
index=['A', 'B'])
print(df.apply(lambda x: pd.Series(x.values).replace(to_replace=0, method='ffill').values, axis=1))
Output:
A [10, 10, 10, 15, 15, 21, 45, 45, 45, 7]
B [11, 12, 12, 12, 3, 3, 3, 4, 5, 5]
dtype: object
df.applymap(lambda x:pd.NA if x==0 else x).fillna(method='ffill',axis=1)
1 2 3 4 5 6 7 8 9 10
A 10 10 10 15 15 21 45 45 45 7
Related
I would like to subtract two data frames by indexes:
# importing pandas as pd
import pandas as pd
# Creating the second dataframe
df1 = pd.DataFrame({"Type":['T1', 'T2', 'T3', 'T4', 'T5'],
"A":[10, 11, 7, 8, 5],
"B":[21, 5, 32, 4, 6],
"C":[11, 21, 23, 7, 9],
"D":[1, 5, 3, 8, 6]},
index =["2001", "2002", "2003", "2004", "2005"])
df1
# Creating the first dataframe
df2 = pd.DataFrame({"A":[1, 2, 2, 2],
"B":[3, 2, 4, 3],
"C":[2, 2, 7, 3],
"D":[1, 3, 2, 1]},
index =["2000", "2002", "2003", "2004"])
df2
# Desired
df = pd.DataFrame({"Type":['T1', 'T2', 'T3', 'T4', 'T5'],
"A":[10, 9, 5, 6, 5],
"B":[21, 3, 28, 1, 6],
"C":[11, 19, 16, 4, 9],
"D":[1, 2, 1, 7, 5]},
index =["2001", "2002", "2003", "2004", "2005"])
df
df1.subtract(df2)
However, it returns in some cases NAs, I would like to keep values from the first df1 if not deductable.
You could handle NaN using:
df1.subtract(df2).combine_first(df1).dropna(how='all')
output:
A B C D Type
2001 10.0 21.0 11.0 1.0 T1
2002 9.0 3.0 19.0 2.0 T2
2003 5.0 28.0 16.0 1.0 T3
2004 6.0 1.0 4.0 7.0 T4
2005 5.0 6.0 9.0 6.0 T5
You can use select_dtypes to choose the correct data type, then subtract the reindex data:
(df1.select_dtypes(include='number')
.sub(df2.reindex(df1.index, fill_value=0))
.join(df1.select_dtypes(exclude='number'))
)
Output:
A B C D Type
2001 10 21 11 1 T1
2002 9 3 19 2 T2
2003 5 28 16 1 T3
2004 6 1 4 7 T4
2005 5 6 9 6 T5
I have a dataset looks like below:
state Item_Number
0 AP 1.0, 4.0, 20.0, 2.0, 11.0, 7.0
1 GOA 1.0, 4.0, nan, 2.0, 8.0, nan
2 GU 1.0, 4.0, 13.0, 2.0, 11.0, 7.0
3 KA 1.0, 23.0, nan, nan, 11.0, 7.0
4 MA 1.0, 14.0, 13.0, 2.0, 19.0, 21.0
I want to remove NaN values and sort the rows, as well as convert float to int. After completion the dataset should looks like below:
state Item_Number
0 AP 1, 2, 4, 7, 11, 20
1 GOA 1, 2, 4, 8
2 GU 1, 2, 4, 7, 11, 13
3 KA 1, 7, 11, 23
4 MA 1, 2, 13, 14, 19, 21
Another solution using Series.str.split and Series.apply:
df['Item_Number'] = (df.Item_Number.str.split(',')
.apply(lambda x: ', '.join([str(z) for z in sorted([int(float(y)) for y in x if 'nan' not in y])])))
[out]
state Item_Number
0 AP 1, 2, 4, 7, 11, 20
1 GOA 1, 2, 4, 8
2 GU 1, 2, 4, 7, 11, 13
3 KA 1, 7, 11, 23
4 MA 1, 2, 13, 14, 19, 21
Use list comprehension with remove missing values by principe NaN != NaN:
df['Item_Number'] = [sorted([int(float(y)) for y in x.split(',') if float(y) == float(y)]) for x in df['Item_Number']]
print (df)
state Item_Number
0 AP [1, 2, 4, 7, 11, 20]
1 GOA [1, 2, 4, 8]
2 GU [1, 2, 4, 7, 11, 13]
3 KA [1, 7, 11, 23]
4 MA [1, 2, 13, 14, 19, 21]
If need strings:
df['Item_Number'] = [' '.join(map(str, sorted([int(float(y)) for y in x.split(',') if float(y) == float(y)]))) for x in df['Item_Number']]
print (df)
state Item_Number
0 AP 1 2 4 7 11 20
1 GOA 1 2 4 8
2 GU 1 2 4 7 11 13
3 KA 1 7 11 23
4 MA 1 2 13 14 19 21
I have a data frame with the temperatures recorded per day/month/year.
Then I find the lowest temperature from each month using groupby and min functions, which gives a data series with multiple index.
How can I drop a value from a specific year and month? eg. year 2005 month 12?
# Find the lowest value per each month
[In] low = df.groupby([df['Date'].dt.year,df['Date'].dt.month])['Data_Value'].min()
[In] low
[Out]
Date Date
2005 1 -60
2 -114
3 -153
4 -13
5 -14
6 26
7 83
8 65
9 21
10 36
11 -36
12 -86
2006 1 -75
2 -53
3 -83
4 -30
5 36
6 17
7 85
8 82
9 66
10 40
11 -2
12 -32
2007 1 -63
2 -42
3 -21
4 -11
5 28
6 74
7 73
8 61
9 46
10 -33
11 -37
12 -97
[In] low.index
[Out] MultiIndex(levels=[[2005, 2006, 2007], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]],
names=['Date', 'Date'])
This works.
#dummy data
mux = pd.MultiIndex.from_arrays([
(2017,)*12 + (2018,)*12,
list(range(1, 13))*2
], names=['year', 'month'])
df = pd.DataFrame({'value': np.random.randint(1, 20, (len(mux)))}, mux)
Then just use drop.
df.drop((2017, 12), inplace=True)
>>> print(df)
value
year month
2017 1 18
2 13
3 14
4 1
5 8
6 19
7 19
8 8
9 11
10 5
11 7 <<<
2018 1 9
2 18
3 9
4 14
5 7
6 4
7 6
8 12
9 12
10 1
11 19
12 10
have two data frames
import pandas as pd
df = pd.DataFrame({'x': [10, 47, 58, 68, 75, 80],
'y': [10, 9, 8, 7, 6, 5]})
df2 = pd.DataFrame({'x': [45, 55, 66, 69, 79, 82], 'y': [10, 9, 8, 7, 6, 5]})
df1
x y
10 10
47 9
58 8
68 7
75 6
80 5
df2
x y
45 10
55 9
66 8
69 7
79 6
82 5
I want to interpolate between them and generate a new data frame with a sampling rate of N. Assume N=3 for this example.
The desired output is
x y
10 10
27.5 10
45 10
...
75 6
77 6
79 6
80 5
81 5
82 5
How can I use my data frames to create the desired output?
If you don't mind using numpy, this solution will give you your desired output:
import pandas as pd
import numpy as np
N = 3
df = pd.DataFrame({'x': [10, 47, 58, 68, 75, 80],
'y': [10, 9, 8, 7, 6, 5]})
df2 = pd.DataFrame({'x': [45, 55, 66, 69, 79, 82], 'y': [10, 9, 8, 7, 6, 5]})
new_x = np.array([np.linspace(i, j, N) for i, j in zip(df['x'], df2['x'])]).flatten()
new_y = df['y'].loc[np.repeat(df.index.values, N)]
final_df = pd.DataFrame({'x': new_x, 'y': new_y})
print(final_df)
Output
x y
0 10.0 10
1 27.5 10
2 45.0 10
3 47.0 9
...
15 80.0 5
16 81.0 5
17 82.0 5
I have the following data set in pandas.
import numpy as np
import pandas as pd
events = ['event1', 'event2', 'event3', 'event4', 'event5', 'event6']
wells = [np.array([1, 2]), np.array([1, 3]), np.array([1]),
np.array([4, 5, 6]), np.array([4, 5, 6]), np.array([7, 8])]
traces_per_well = [np.array([24, 24]), np.array([24, 21]), np.array([18]),
np.array([24, 24, 24]), np.array([24, 21, 24]), np.array([18, 21])]
df = pd.DataFrame({"event_no": events, "well_array": wells,
"trace_per_well": traces_per_well})
df["total_traces"] = df['trace_per_well'].apply(np.sum)
df['supposed_traces_no'] = df['well_array'].apply(lambda x: len(x)*24)
df['pass'] = df['total_traces'] == df['supposed_traces_no']
print(df)
the output is printed below:
event_no well_array trace_per_well total_traces supposed_traces_no pass
0 event1 [1, 2] [24, 24] 48 48 True
1 event2 [1, 3] [24, 21] 45 48 False
2 event3 [1] [18] 18 24 False
3 event4 [4, 5, 6] [24, 24, 24] 72 72 True
4 event5 [4, 5, 6] [24, 21, 24] 69 72 False
5 event6 [7, 8] [18, 21] 39 48 False
I want to create two new columns in which the item of numpy array from column trace_per_well when it is not equal to 24 will be put in one column and the corresponding array element from column well_array in another column
The result should look like this.
event_no well_array trace_per_well total_traces supposed_traces_no pass wrong_trace_in_well wrong_well
0 event1 [1, 2] [24, 24] 48 48 True NaN NaN
1 event2 [1, 3] [24, 21] 45 48 False 21 3
2 event3 [1] [18] 18 24 False 18 1
3 event4 [4, 5, 6] [24, 24, 24] 72 72 True NaN NaN
4 event5 [4, 5, 6] [24, 21, 24] 69 72 False 21 5
5 event6 [7, 8] [18, 21] 39 48 False (18, 21) (7, 8)
Any help is greatly appreciated!
I would do this with a list comprehension. Generate your result in a single pass of the data and then assign to appropriate columns.
v = pd.Series(
[list(zip(*((x, y) for x, y in zip(X, Y) if x != 24)))
for X, Y in zip(df['trace_per_well'], df['well_array'])])
df['wrong_trace_in_well'] = v.str[0]
df['wrong_well'] = v.str[-1]
df[['wrong_trace_in_well', 'wrong_well']]
wrong_trace_in_well wrong_well
0 NaN NaN
1 (21,) (3,)
2 (18,) (1,)
3 NaN NaN
4 (21,) (5,)
5 (18, 21) (7, 8)
Alternatively, if you want to do this in multiple passes, then
df['wrong_trace_in_well'] = [[x for x in X if x != 24] for X in df['trace_per_well']]
df['wrong_well'] = [
[y for x, y in zip(X, Y) if x != 24]
for X, Y in zip(df['trace_per_well'], df['well_array'])]
df[['wrong_trace_in_well', 'wrong_well']]
wrong_trace_in_well wrong_well
0 [] []
1 [21] [3]
2 [18] [1]
3 [] []
4 [21] [5]
5 [18, 21] [7, 8]