What is the best way to fill missing values in dataframe with items from list?
For example:
pd.DataFrame([[1,2,3],[4,5],[7,8],[10,11,12],[13,14]])
0 1 2
0 1 2 3
1 4 5 NaN
2 7 8 NaN
3 10 11 12
4 13 14 NaN
list = [6, 9, 150]
to get some something like this:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15
this is actually a little tricky and a bit of a hack, if you know the column you want to fill the NaN values for then you can construct a df for that column with the indices of the missing values and pass the df to fillna:
In [33]:
fill = pd.DataFrame(index =df.index[df.isnull().any(axis=1)], data= [6, 9, 150],columns=[2])
df.fillna(fill)
Out[33]:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 150
You can't pass a dict (my original answer) as the dict key values are the column values to match on and the scalar value will be used for all NaN values for that column which is not what you want:
In [40]:
l=[6, 9, 150]
df.fillna(dict(zip(df.index[df.isnull().any(axis=1)],l)))
Out[40]:
0 1 2
0 1 2 3
1 4 5 9
2 7 8 9
3 10 11 12
4 13 14 9
You can see that it has replaced all NaNs with 9 as it matched the missing NaN index value of 2 with column 2.
Related
I want to process each N rows of a DataFrame separately.If my data has 15 row indexed from 0 to 14 I want to process rows from index 0 to 3 , 4 to 7, 8 to 11, 12 to 15
for example let's say for each 4 rows I want the sum(A) and the mean(B)
Index
A
B
0
4
4
1
7
9
2
9
3
3
0
4
4
7
9
5
9
2
6
3
0
7
7
4
8
7
2
9
1
6
The Resulted DataFrame should be
Index
A
B
0
20
5
1
26
3.75
2
8
4
TLDR: how to let DataFrame.apply takes multiple rows instead of a single row at a time
Use GroupBy.agg with integer division by 4 by index:
#default RangeIndex
df = df.groupby(df.index // 4).agg({'A':'sum', 'B':'mean'})
#any index
df = df.groupby(np.arange(len(df.index)) // 4).agg({'A':'sum', 'B':'mean'})
print (df)
A B
0 20 5.00
1 26 3.75
2 8 4.00
I have a dataframe like below:
A B C
1 8 23
2 8 22
3 9 45
4 9 45
5 6 12
6 4 10
7 11 12
I want to drop duplicates where keep the first value in the consecutive occurence if the C is also the same.
E.G here occurence '9' is column B is repetitive and their correponding occurences in column 'C' is also repetitive '45'. In this case i want to retain the first occurence.
Expected Output:
A B C
1 8 23
2 8 22
3 9 45
5 6 12
6 4 10
7 11 12
I tried some group by, but didnot know how to drop.
code:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
test=df.groupby('consecutive',as_index=False).apply(lambda x: (x['B'].head(1),x.shape[0],
x['C'].iloc[-1] - x['C'].iloc[0]))
This group by returns me a series, but i want to drop.
Add DataFrame.drop_duplicates by 2 columns:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
df = df.drop_duplicates(['consecutive','C'])
print (df)
A B C consecutive
0 1 8 23 1
1 2 8 22 1
2 3 9 45 2
4 5 6 12 3
5 6 4 10 4
6 7 11 12 5
Or chain both conditions with | for bitwise OR:
df = df[(df['B'] != df['B'].shift()) | (df['C'] != df['C'].shift())]
print (df)
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
the easy way to check the difference between row of B and C then drop value if difference is 0 (duplicate values), the code is
df[ ~((df.B.diff()==0) & (df.C.diff()==0)) ]
A oneliner to filter out such records is:
df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
Here we thus check if the columns ['B', 'C'] is the same as the shifted rows, if it is not, we retain the values:
>>> df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
This is quite scalable, since we can define a function that will easily operate on an arbitrary number of values:
def drop_consecutive_duplicates(df, *colnames):
dff = df[list(colnames)]
return df[(dff.shift() != dff).any(axis=1)]
So you can then filter with:
drop_consecutive_duplicates(df, 'B', 'C')
Using diff, ne and any over axis=1:
Note: this method only works for numeric columns
m = df[['B', 'C']].diff().ne(0).any(axis=1)
print(df[m])
Output
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Details
df[['B', 'C']].diff()
B C
0 NaN NaN
1 0.0 -1.0
2 1.0 23.0
3 0.0 0.0
4 -3.0 -33.0
5 -2.0 -2.0
6 7.0 2.0
Then we check if any of the values in a row are not equal (ne) to 0:
df[['B', 'C']].diff().ne(0).any(axis=1)
0 True
1 True
2 True
3 False
4 True
5 True
6 True
dtype: bool
You can compute a series of the rows to drop, and then drop them:
to_drop = (df['B'] == df['B'].shift())&(df['C']==df['C'].shift())
df = df[~to_drop]
It gives as expected:
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Code
df1 = df.drop_duplicates(subset=['B', 'C'])
Result
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
If I understand your question correctly, given the following dataframe:
df = pd.DataFrame({'B': [8, 8, 9, 9, 6, 4, 11], 'C': [22, 23, 45, 45, 12, 10, 12],})
This one-line code solved your problem using the drop_duplicates method:
df.drop_duplicates(['B', 'C'])
It gives as expected results:
B C
0 8 22
1 8 23
2 9 45
4 6 12
5 4 10
6 11 12
Let say I have my data shaped as in this example
idx = pd.MultiIndex.from_product([[1, 2, 3, 4, 5, 6], ['a', 'b', 'c']],
names=['numbers', 'letters'])
col = ['Value']
df = pd.DataFrame(list(range(18)), idx, col)
print(df.unstack())
The output will be
Value
letters a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
letters and numbers are indexes and Value is the only column
The question is how can I replace Value column with columns named as values of index letters?
So I would like to get such output
numbers a b c
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
where a, b and c are columns and numbers is the only index.
Appreciate your help.
The problem is caused by you are using unstack with DataFrame, not pd.Series
df.Value.unstack().rename_axis(None,1)
Out[151]:
a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
Wen-Ben's answer prevents you from running into a data frame with multiple column levels in the first place.
If you happened to be stuck with a multi-index column anyway, you can get rid of it by using .droplevel():
df = df.unstack()
df.columns = df.columns.droplevel()
df
Out[7]:
letters a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
Consider the following dataframe:
index count signal
1 1 1
2 1 NAN
3 1 NAN
4 1 -1
5 1 NAN
6 2 NAN
7 2 -1
8 2 NAN
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 NAN
14 4 NAN
I need to 'ffill' the NANs in 'signal' and values with different 'count' value should not affect each other. such that I should get the following dataframe:
index count signal
1 1 1
2 1 1
3 1 1
4 1 -1
5 1 -1
6 2 NAN
7 2 -1
8 2 -1
9 3 NAN
10 3 NAN
11 3 NAN
12 4 1
13 4 1
14 4 1
Right now I iterate through each data frame in group by object and fill NAN value and then copy to a new data frame:
new_table = np.array([]);
for key, group in df.groupby('count'):
group['signal'] = group['signal'].fillna(method='ffill')
group1 = group.copy()
if new_table.shape[0]==0:
new_table = group1
else:
new_table = pd.concat([new_table,group1])
which kinda works, but really slow considering the data frame is large. I am wondering if there is any other method to do it with or without groupby methods. Thanks!
EDITED:
Thanks to Alexander and jwilner for providing alternative methods. However both methods are very slow for my big dataframe which has 800,000 rows of data.
Use the apply method.
In [56]: df = pd.DataFrame({"count": [1] * 4 + [2] * 5 + [3] * 2 , "signal": [1] + [None] * 4 + [-1] + [None] * 5})
In [57]: df
Out[57]:
count signal
0 1 1
1 1 NaN
2 1 NaN
3 1 NaN
4 2 NaN
5 2 -1
6 2 NaN
7 2 NaN
8 2 NaN
9 3 NaN
10 3 NaN
[11 rows x 2 columns]
In [58]: def ffill_signal(df):
....: df["signal"] = df["signal"].ffill()
....: return df
....:
In [59]: df.groupby("count").apply(ffill_signal)
Out[59]:
count signal
0 1 1
1 1 1
2 1 1
3 1 1
4 2 NaN
5 2 -1
6 2 -1
7 2 -1
8 2 -1
9 3 NaN
10 3 NaN
[11 rows x 2 columns]
However, be aware that groupby reorders stuff. If the count column doesn't always stay the same or increase, but instead can have values repeated in it, groupby might be problematic. That is, given a count series like [1, 1, 2, 2, 1], groupby will group like so: [1, 1, 1], [2, 2], which could have possibly undesirable effects on your forward filling. If that were undesired, you'd have to create a new series to use with groupby that always stayed the same or increased according to changes in the count series -- probably using pd.Series.diff and pd.Series.cumsum
I know it's very late, but I found a solution that is much faster than those proposed, namely to collect the updated dataframes in a list and do the concatenation only at the end. To take your example:
new_table = []
for key, group in df.groupby('count'):
group['signal'] = group['signal'].fillna(method='ffill')
group1 = group.copy()
if new_table.shape[0]==0:
new_table = [group1]
else:
new_table.append(group1)
new_table = pd.concat(new_table).reset_index(drop=True)
An alternative solution is to create a pivot table, forward fill values, and then map them back into the original DataFrame.
df2 = df.pivot(columns='count', values='signal', index='index').ffill()
df['signal'] = [df2.at[i, c]
for i, c in zip(df2.index, df['count'].tolist())]
>>> df
count index signal
0 1 1 1
1 1 2 1
2 1 3 1
3 1 4 -1
4 1 5 -1
5 2 6 NaN
6 2 7 -1
7 2 8 -1
8 3 9 NaN
9 3 10 NaN
10 3 11 NaN
11 4 12 1
12 4 13 1
13 4 14 1
With 800k rows of data, the efficacy of this approach depends on how many unique values are in 'count'.
Compared to my prior answer:
%%timeit
for c in df['count'].unique():
df.loc[df['count'] == c, 'signal'] = df[df['count'] == c].ffill()
100 loops, best of 3: 4.1 ms per loop
%%timeit
df2 = df.pivot(columns='count', values='signal', index='index').ffill()
df['signal'] = [df2.at[i, c] for i, c in zip(df2.index, df['count'].tolist())]
1000 loops, best of 3: 1.32 ms per loop
Lastly, you can simply use groupby, although it is slower than the previous method:
df.groupby('count').ffill()
Out[191]:
index signal
0 1 1
1 2 1
2 3 1
3 4 -1
4 5 -1
5 6 NaN
6 7 -1
7 8 -1
8 9 NaN
9 10 NaN
10 11 NaN
11 12 1
12 13 1
13 14 1
%%timeit
df.groupby('count').ffill()
100 loops, best of 3: 3.55 ms per loop
Assuming the data has been pre-sorted on df['index'], try using loc instead:
for c in df['count'].unique():
df.loc[df['count'] == c, 'signal'] = df[df['count'] == c].ffill()
>>> df
index count signal
0 1 1 1
1 2 1 1
2 3 1 1
3 4 1 -1
4 5 1 -1
5 6 2 NaN
6 7 2 -1
7 8 2 -1
8 9 3 NaN
9 10 3 NaN
10 11 3 NaN
11 12 4 1
12 13 4 1
13 14 4 1
I want to replace some missing values in a dataframe with some other values, keeping the index alignment.
For example, in the following dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.repeat(['a','b','c'],4), 'B':np.tile([1,2,3,4],3),'C':range(12),'D':range(12)})
df = df.iloc[:-1]
df.set_index(['A','B'], inplace=True)
df.loc['b'] = np.nan
df
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
c 1 8 8
2 9 9
3 10 10
I would like to replace the missing values of 'b' rows matching them with the corresponding indices of 'c' rows.
The result should look like
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 8 8
2 9 9
3 10 10
4 NaN NaN
c 1 8 8
2 9 9
3 10 10
You can use fillna with the values dictionary to_dict from relevant c rows, like this:
# you can of course use .loc
>>> df.ix['b'].fillna(value=df.ix['c'].to_dict(), inplace=True)
C D
B
1 8 8
2 9 9
3 10 10
4 NaN NaN
Result:
>>> df
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 8 8
2 9 9
3 10 10
4 NaN NaN
c 1 8 8
2 9 9
3 10 10