Pandas: fill NaNs based on group text values [duplicate] - python

The following is the pandas dataframe I have:
cluster Value
1 A
1 NaN
1 NaN
1 NaN
1 NaN
2 NaN
2 NaN
2 B
2 NaN
3 NaN
3 NaN
3 C
3 NaN
4 NaN
4 S
4 NaN
5 NaN
5 A
5 NaN
5 NaN
If we look into the data, cluster 1 has Value 'A' for one row and remain all are NA values. I want to fill 'A' value for all the rows of cluster 1. Similarly for all the clusters. Based on one of the values of the cluster, I want to fill the remaining rows of the cluster. The output should be like,
cluster Value
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
2 B
3 C
3 C
3 C
3 C
4 S
4 S
4 S
5 A
5 A
5 A
5 A
I am new to python and not sure how to proceed with this. Can anybody help with this ?

groupby + bfill, and ffill
df = df.groupby('cluster').bfill().ffill()
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Or,
groupby + transform with first
df['Value'] = df.groupby('cluster').Value.transform('first')
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A

Edit
The following seems better:
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
df['Value'] = df['cluster'].map(nan_map)
print(df)
Original
I can't think of a better way to do this than iterate over all the rows, but one might exist. First I built your DataFrame:
import pandas as pd
import math
# Build your DataFrame
df = pd.DataFrame.from_items([
('cluster', [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5]),
('Value', [float('nan') for _ in range(20)]),
])
df['Value'] = df['Value'].astype(object)
df.at[ 0,'Value'] = 'A'
df.at[ 7,'Value'] = 'B'
df.at[11,'Value'] = 'C'
df.at[14,'Value'] = 'S'
df.at[17,'Value'] = 'A'
Now here's an approach that first creates a nan_map dict, then sets the values in Value as specified in the dict.
# Create a dict to map clusters to unique values
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
# nan_map: {1: 'A', 2: 'B', 3: 'C', 4: 'S', 5: 'A'}
# Apply
for i, row in df.iterrows():
df.at[i,'Value'] = nan_map[row['cluster']]
print(df)
Output:
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 C
10 3 C
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Note: This sets all values based on the cluster and doesn't check for NaN-ness. You may want to experiment with something like:
# Apply
for i, row in df.iterrows():
if isinstance(df.at[i,'Value'], float) and math.isnan(df.at[i,'Value']):
df.at[i,'Value'] = nan_map[row['cluster']]
to see which is more efficient (my guess is the former, without the checks).

Related

map values in a dataframe according to ranges

I have a dataframe df
import pandas
df = pandas.DataFrame(data=[1,2,3,2,2,2,3,3,4,5,10,11,12,1,2,1,1], columns=['codes'])
codes
0 1
1 2
2 3
3 2
4 2
5 2
6 3
7 3
8 4
9 5
10 10
11 11
12 12
13 1
14 2
15 1
16 1
and I would like to group the values in the column code
according to a specific logic:
values == 0 become A
values in the range (1,4) becomes B
values == 5 becomes C
values in the range (6,16) becomes D
is there a way to keep the logic and the dataframe separate so that it is easy to change the grouping rules in the future?
I would like to avoid to write
df.loc[df['code']==0,'code']=A
df.loc[(df['code']>=1 & df['code']<=4),'code']=B
First idea is use Series.map with merge dictionaries, second is use cut with right=False:
df = pd.DataFrame(data=[0,1,2,3,2,2,2,3,3,4,5,10,11,12,16,2,17,1], columns=['codes'])
d1 = {0: 'A', 5:'C'}
d2 = dict.fromkeys(range(1,5), 'B')
d3 = dict.fromkeys(range(6,17), 'D')
d = {**d1, **d2, **d3}
df['codes1'] = df['codes'].map(d)
df['codes2'] = pd.cut(df['codes'], bins=(0,1,5,6,17), labels=list('ABCD'), right=False)
print (df)
codes codes1 codes2
0 0 A A
1 1 B B
2 2 B B
3 3 B B
4 2 B B
5 2 B B
6 2 B B
7 3 B B
8 3 B B
9 4 B B
10 5 C C
11 10 D D
12 11 D D
13 12 D D
14 16 D D
15 2 B B
16 17 NaN NaN
17 1 B B

Add Missing Values To Pandas Groups

Let's say I have a DataFrame like:
import pandas as pd
df = pd.DataFrame({"Quarter": [1,2,3,4,1,2,3,4,4],
"Type": ["a","a","a","a","b","b","c","c","d"],
"Value": [4,1,3,4,7,2,9,4,1]})
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 c 9
7 4 c 4
8 4 d 1
For each Type, there needs to be a total of 4 rows that represent one of four quarters as indicated by the Quarter column. So, it would look like:
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b NaN
7 4 b NaN
8 1 c NaN
9 2 c NaN
10 3 c 9
11 4 c 4
12 1 d NaN
13 2 d NaN
14 3 d NaN
15 4 d 1
Then, where there are missing values in the Value column, fill the missing values using the next closest available value with the same Type (if it's the last quarter that is missing then fill with the third quarter):
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b 2
7 4 b 2
8 1 c 9
9 2 c 9
10 3 c 9
11 4 c 4
12 1 d 1
13 2 d 1
14 3 d 1
15 4 d 1
What's the best way to accomplish this?
Use reindex:
idx = pd.MultiIndex.from_product([
df['Type'].unique(),
range(1,5)
], names=['Type', 'Quarter'])
df.set_index(['Type', 'Quarter']).reindex(idx) \
.groupby('Type') \
.transform(lambda v: v.ffill().bfill()) \
.reset_index()
you can use set_index and unstack to create the missing rows you want (assuming each quarter is available in at least one type), then ffill and bfill over the columns and finally stack and reset_index to go back to the original shape
df = df.set_index(['Type', 'Quarter']).unstack()\
.ffill(axis=1).bfill(axis=1)\
.stack().reset_index()
print (df)
Type Quarter Value
0 a 1 4.0
1 a 2 1.0
2 a 3 3.0
3 a 4 4.0
4 b 1 7.0
5 b 2 2.0
6 b 3 2.0
7 b 4 2.0
8 c 1 9.0
9 c 2 9.0
10 c 3 9.0
11 c 4 4.0
12 d 1 1.0
13 d 2 1.0
14 d 3 1.0
15 d 4 1.0

Pandas how to output distinct values in column based on duplicate in another column

Here an example:
import pandas as pd
df = pd.DataFrame({
'product':['1','1','1','2','2','2','3','3','3','4','4','4','5','5','5'],
'value':['a','a','a','a','a','b','a','b','a','b','b','b','a','a','a']
})
product value
0 1 a
1 1 a
2 1 a
3 2 a
4 2 a
5 2 b
6 3 a
7 3 b
8 3 a
9 4 b
10 4 b
11 4 b
12 5 a
13 5 a
14 5 a
I need to output:
1 a
4 b
5 a
Because 'value' values for distinct 'product' values all are same
I'm sorry for bad English
I think you need this
m=df.groupby('product')['value'].transform('nunique')
df.loc[m==1].drop_duplicates(). reset_index(drop=True)
Output
product value
0 1 a
1 4 b
2 5 a
Details
df.groupby('product')['value'].transform('nunique') returns a series as below
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 2
9 1
10 1
11 1
12 1
13 1
14 1
where the numbers of the number of unique values in each group. Then we use df.loc to get only the rows in which this value is 1, so, the groups with unique values.
The we drop duplicates since you need only the group & its unique value.
If I undestand correctly your question, this simple code is for your:
distinct_prod_df = df.drop_duplicates(['product'])
and gives:
product value
0 1 a
3 2 a
6 3 a
9 4 b
12 5 a
You can try this:
mask = df.groupby('product').apply(lambda x: x.nunique() == 1)
df = df[mask].drop_duplicates()

Fill all values in a group with the first non-null value in that group

The following is the pandas dataframe I have:
cluster Value
1 A
1 NaN
1 NaN
1 NaN
1 NaN
2 NaN
2 NaN
2 B
2 NaN
3 NaN
3 NaN
3 C
3 NaN
4 NaN
4 S
4 NaN
5 NaN
5 A
5 NaN
5 NaN
If we look into the data, cluster 1 has Value 'A' for one row and remain all are NA values. I want to fill 'A' value for all the rows of cluster 1. Similarly for all the clusters. Based on one of the values of the cluster, I want to fill the remaining rows of the cluster. The output should be like,
cluster Value
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
2 B
3 C
3 C
3 C
3 C
4 S
4 S
4 S
5 A
5 A
5 A
5 A
I am new to python and not sure how to proceed with this. Can anybody help with this ?
groupby + bfill, and ffill
df = df.groupby('cluster').bfill().ffill()
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Or,
groupby + transform with first
df['Value'] = df.groupby('cluster').Value.transform('first')
df
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Edit
The following seems better:
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
df['Value'] = df['cluster'].map(nan_map)
print(df)
Original
I can't think of a better way to do this than iterate over all the rows, but one might exist. First I built your DataFrame:
import pandas as pd
import math
# Build your DataFrame
df = pd.DataFrame.from_items([
('cluster', [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5,5]),
('Value', [float('nan') for _ in range(20)]),
])
df['Value'] = df['Value'].astype(object)
df.at[ 0,'Value'] = 'A'
df.at[ 7,'Value'] = 'B'
df.at[11,'Value'] = 'C'
df.at[14,'Value'] = 'S'
df.at[17,'Value'] = 'A'
Now here's an approach that first creates a nan_map dict, then sets the values in Value as specified in the dict.
# Create a dict to map clusters to unique values
nan_map = df.dropna().set_index('cluster').to_dict()['Value']
# nan_map: {1: 'A', 2: 'B', 3: 'C', 4: 'S', 5: 'A'}
# Apply
for i, row in df.iterrows():
df.at[i,'Value'] = nan_map[row['cluster']]
print(df)
Output:
cluster Value
0 1 A
1 1 A
2 1 A
3 1 A
4 1 A
5 2 B
6 2 B
7 2 B
8 2 B
9 3 C
10 3 C
11 3 C
12 3 C
13 4 S
14 4 S
15 4 S
16 5 A
17 5 A
18 5 A
19 5 A
Note: This sets all values based on the cluster and doesn't check for NaN-ness. You may want to experiment with something like:
# Apply
for i, row in df.iterrows():
if isinstance(df.at[i,'Value'], float) and math.isnan(df.at[i,'Value']):
df.at[i,'Value'] = nan_map[row['cluster']]
to see which is more efficient (my guess is the former, without the checks).

How to replace values in pandas DataFrame respecting index alignment

I want to replace some missing values in a dataframe with some other values, keeping the index alignment.
For example, in the following dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.repeat(['a','b','c'],4), 'B':np.tile([1,2,3,4],3),'C':range(12),'D':range(12)})
df = df.iloc[:-1]
df.set_index(['A','B'], inplace=True)
df.loc['b'] = np.nan
df
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
c 1 8 8
2 9 9
3 10 10
I would like to replace the missing values of 'b' rows matching them with the corresponding indices of 'c' rows.
The result should look like
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 8 8
2 9 9
3 10 10
4 NaN NaN
c 1 8 8
2 9 9
3 10 10
You can use fillna with the values dictionary to_dict from relevant c rows, like this:
# you can of course use .loc
>>> df.ix['b'].fillna(value=df.ix['c'].to_dict(), inplace=True)
C D
B
1 8 8
2 9 9
3 10 10
4 NaN NaN
Result:
>>> df
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 8 8
2 9 9
3 10 10
4 NaN NaN
c 1 8 8
2 9 9
3 10 10

Categories