I want to replace some missing values in a dataframe with some other values, keeping the index alignment.
For example, in the following dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.repeat(['a','b','c'],4), 'B':np.tile([1,2,3,4],3),'C':range(12),'D':range(12)})
df = df.iloc[:-1]
df.set_index(['A','B'], inplace=True)
df.loc['b'] = np.nan
df
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
c 1 8 8
2 9 9
3 10 10
I would like to replace the missing values of 'b' rows matching them with the corresponding indices of 'c' rows.
The result should look like
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 8 8
2 9 9
3 10 10
4 NaN NaN
c 1 8 8
2 9 9
3 10 10
You can use fillna with the values dictionary to_dict from relevant c rows, like this:
# you can of course use .loc
>>> df.ix['b'].fillna(value=df.ix['c'].to_dict(), inplace=True)
C D
B
1 8 8
2 9 9
3 10 10
4 NaN NaN
Result:
>>> df
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 8 8
2 9 9
3 10 10
4 NaN NaN
c 1 8 8
2 9 9
3 10 10
Related
i have 2 dataframes df1, df2 and i want to compare these dataframes.
import pandas as pd
df1 = pd.DataFrame({'Column1': ['f','c','b','d','e','g','h'],
'Column2': ['1','2','3','4','5','7','8']})
df2 = pd.DataFrame({'Column1': ['f','b','d','e','a','g'],
'Column2': ['1','3','4','5','6','7']})
To compare dataframes, i use pandas merge. here's my code.
df = pd.merge(df1,df2, how="outer", on="Column1")
And Result :
Column1 Column2_x Column2_y
0 f 1 1
1 c 2 NaN
2 b 3 3
3 d 4 4
4 e 5 5
5 g 7 7
6 h 8 NaN
7 a NaN 6
But i don't want this result... the output what i want is below :
how can i get the output???
What i want :
Column1 Column2_x Column2_y
0 f 1 1
1 c 2 NaN
2 b 3 3
3 d 4 4
4 e 5 5
5 a NaN 6
6 g 7 7
7 h 8 NaN
It looks like you want to preserve the row order in df1. Try:
(df1.assign(enum=np.arange(len(df1)))
.merge(df2, on='Column1', how='outer')
.sort_values('enum')
.drop(columns=['enum'])
)
Output:
Column1 Column2_x Column2_y
0 f 1 1
1 c 2 NaN
2 b 3 3
3 d 4 4
4 e 5 5
5 g 7 7
6 h 8 NaN
7 a NaN 6
The following is the pandas dataframe I have:
cluster Value
1 A
1 NaN
1 NaN
1 NaN
1 NaN
2 NaN
2 NaN
2 B
2 NaN
3 NaN
3 NaN
3 C
3 NaN
4 NaN
4 S
4 NaN
5 NaN
5 A
5 NaN
5 NaN
If we look into the data, cluster 1 has Value 'A' for one row and remain all are NA values. I want to fill 'A' value for all the rows of cluster 1. Similarly for all the clusters. Based on one of the values of the cluster, I want to fill the remaining rows of the cluster. The output should be like,
cluster Value
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
2 B
3 C
3 C
3 C
3 C
4 S
4 S
4 S
5 A
5 A
5 A
5 A
I am new to python and not sure how to proceed with this. Can anybody help with this ?
groupby + bfill, and ffill
df = df.groupby('cluster').bfill().ffill()
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Or,
groupby + transform with first
df['Value'] = df.groupby('cluster').Value.transform('first')
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Edit
The following seems better:
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
df['Value'] = df['cluster'].map(nan_map)
print(df)
Original
I can't think of a better way to do this than iterate over all the rows, but one might exist. First I built your DataFrame:
import pandas as pd
import math
# Build your DataFrame
df = pd.DataFrame.from_items([
('cluster', [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5]),
('Value', [float('nan') for _ in range(20)]),
])
df['Value'] = df['Value'].astype(object)
df.at[ 0,'Value'] = 'A'
df.at[ 7,'Value'] = 'B'
df.at[11,'Value'] = 'C'
df.at[14,'Value'] = 'S'
df.at[17,'Value'] = 'A'
Now here's an approach that first creates a nan_map dict, then sets the values in Value as specified in the dict.
# Create a dict to map clusters to unique values
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
# nan_map: {1: 'A', 2: 'B', 3: 'C', 4: 'S', 5: 'A'}
# Apply
for i, row in df.iterrows():
df.at[i,'Value'] = nan_map[row['cluster']]
print(df)
Output:
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 C
10 3 C
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Note: This sets all values based on the cluster and doesn't check for NaN-ness. You may want to experiment with something like:
# Apply
for i, row in df.iterrows():
if isinstance(df.at[i,'Value'], float) and math.isnan(df.at[i,'Value']):
df.at[i,'Value'] = nan_map[row['cluster']]
to see which is more efficient (my guess is the former, without the checks).
Let's say I have a DataFrame like:
import pandas as pd
df = pd.DataFrame({"Quarter": [1,2,3,4,1,2,3,4,4],
"Type": ["a","a","a","a","b","b","c","c","d"],
"Value": [4,1,3,4,7,2,9,4,1]})
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 c 9
7 4 c 4
8 4 d 1
For each Type, there needs to be a total of 4 rows that represent one of four quarters as indicated by the Quarter column. So, it would look like:
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b NaN
7 4 b NaN
8 1 c NaN
9 2 c NaN
10 3 c 9
11 4 c 4
12 1 d NaN
13 2 d NaN
14 3 d NaN
15 4 d 1
Then, where there are missing values in the Value column, fill the missing values using the next closest available value with the same Type (if it's the last quarter that is missing then fill with the third quarter):
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b 2
7 4 b 2
8 1 c 9
9 2 c 9
10 3 c 9
11 4 c 4
12 1 d 1
13 2 d 1
14 3 d 1
15 4 d 1
What's the best way to accomplish this?
Use reindex:
idx = pd.MultiIndex.from_product([
df['Type'].unique(),
range(1,5)
], names=['Type', 'Quarter'])
df.set_index(['Type', 'Quarter']).reindex(idx) \
.groupby('Type') \
.transform(lambda v: v.ffill().bfill()) \
.reset_index()
you can use set_index and unstack to create the missing rows you want (assuming each quarter is available in at least one type), then ffill and bfill over the columns and finally stack and reset_index to go back to the original shape
df = df.set_index(['Type', 'Quarter']).unstack()\
.ffill(axis=1).bfill(axis=1)\
.stack().reset_index()
print (df)
Type Quarter Value
0 a 1 4.0
1 a 2 1.0
2 a 3 3.0
3 a 4 4.0
4 b 1 7.0
5 b 2 2.0
6 b 3 2.0
7 b 4 2.0
8 c 1 9.0
9 c 2 9.0
10 c 3 9.0
11 c 4 4.0
12 d 1 1.0
13 d 2 1.0
14 d 3 1.0
15 d 4 1.0
I have a data frame (sample, not real):
df =
A B C D E F
0 3 4 NaN NaN NaN NaN
1 9 8 NaN NaN NaN NaN
2 5 9 4 7 NaN NaN
3 5 7 6 3 NaN NaN
4 2 6 4 3 NaN NaN
Now I want to fill NaN values with previous couple(!!!) values of row (fill Nan with left existing couple of numbers and apply to the whole row) and apply this to the whole dataset.
There are a lot of answers concerning filling the columns. But in
this case I need to fill based on rows.
There are also answers related to fill NaN based on other column, but
in my case number of columns are more than 2000. This is sample data
Desired output is:
df =
A B C D E F
0 3 4 3 4 3 4
1 9 8 9 8 9 8
2 5 9 4 7 4 7
3 5 7 6 3 6 3
4 2 6 4 3 4 3
IIUC, a quick solution without reshaping the data:
df.iloc[:,::2] = df.iloc[:,::2].ffill(1)
df.iloc[:,1::2] = df.iloc[:,1::2].ffill(1)
df
Output:
A B C D E F
0 3 4 3 4 3 4
1 9 8 9 8 9 8
2 5 9 4 7 4 7
3 5 7 6 3 6 3
4 2 6 4 3 4 3
Idea is reshape DataFrame for possible forward and back filling missing values with stack and modulo and integer division of 2 of array by length of columns:
c = df.columns
a = np.arange(len(df.columns))
df.columns = [a // 2, a % 2]
#if possible some pairs missing remove .astype(int)
df1 = df.stack().ffill(axis=1).bfill(axis=1).unstack().astype(int)
df1.columns = c
print (df1)
A B C D E F
0 3 4 3 4 3 4
1 9 8 9 8 9 8
2 5 9 4 7 4 7
3 5 7 6 3 6 3
4 2 6 4 3 4 3
Detail:
print (df.stack())
0 1 2
0 0 3 NaN NaN
1 4 NaN NaN
1 0 9 NaN NaN
1 8 NaN NaN
2 0 5 4.0 NaN
1 9 7.0 NaN
3 0 5 6.0 NaN
1 7 3.0 NaN
4 0 2 4.0 NaN
1 6 3.0 NaN
The following is the pandas dataframe I have:
cluster Value
1 A
1 NaN
1 NaN
1 NaN
1 NaN
2 NaN
2 NaN
2 B
2 NaN
3 NaN
3 NaN
3 C
3 NaN
4 NaN
4 S
4 NaN
5 NaN
5 A
5 NaN
5 NaN
If we look into the data, cluster 1 has Value 'A' for one row and remain all are NA values. I want to fill 'A' value for all the rows of cluster 1. Similarly for all the clusters. Based on one of the values of the cluster, I want to fill the remaining rows of the cluster. The output should be like,
cluster Value
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
2 B
3 C
3 C
3 C
3 C
4 S
4 S
4 S
5 A
5 A
5 A
5 A
I am new to python and not sure how to proceed with this. Can anybody help with this ?
groupby + bfill, and ffill
df = df.groupby('cluster').bfill().ffill()
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Or,
groupby + transform with first
df['Value'] = df.groupby('cluster').Value.transform('first')
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Edit
The following seems better:
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
df['Value'] = df['cluster'].map(nan_map)
print(df)
Original
I can't think of a better way to do this than iterate over all the rows, but one might exist. First I built your DataFrame:
import pandas as pd
import math
# Build your DataFrame
df = pd.DataFrame.from_items([
('cluster', [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5]),
('Value', [float('nan') for _ in range(20)]),
])
df['Value'] = df['Value'].astype(object)
df.at[ 0,'Value'] = 'A'
df.at[ 7,'Value'] = 'B'
df.at[11,'Value'] = 'C'
df.at[14,'Value'] = 'S'
df.at[17,'Value'] = 'A'
Now here's an approach that first creates a nan_map dict, then sets the values in Value as specified in the dict.
# Create a dict to map clusters to unique values
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
# nan_map: {1: 'A', 2: 'B', 3: 'C', 4: 'S', 5: 'A'}
# Apply
for i, row in df.iterrows():
df.at[i,'Value'] = nan_map[row['cluster']]
print(df)
Output:
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 C
10 3 C
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Note: This sets all values based on the cluster and doesn't check for NaN-ness. You may want to experiment with something like:
# Apply
for i, row in df.iterrows():
if isinstance(df.at[i,'Value'], float) and math.isnan(df.at[i,'Value']):
df.at[i,'Value'] = nan_map[row['cluster']]
to see which is more efficient (my guess is the former, without the checks).