I want to fill NaN values in this way: put the same values as the column B if they have the same value in B
Example:
A B
nan 'ra'
9 'ra'
5 'pa'
So NaN value in column A should be 9 because they have the same values in column B.
df['A'] = df.groupby('B')['A'].ffill().bfill()
Output:
>>> df
A B
0 9.0 ra
1 9.0 ra
2 5.0 pa
Related
To keep only the positive values, we can clip a dataframe or a specific column(s) of a dataframe using
df.clip(lower = 0)
But it replaces all negative values with zero. Is it possible to keep only non-negative values and replace all other with Nan?
I looked in this pandas documentation, but fill method is not here.
Another way is to replace all zeros with Nan but it will also convert those values which were actually Zero.
Use DataFrame.mask with DataFrame.lt or DataFrame.le:
print (df)
a b c
0 2 -6 8
1 -5 -8 0
df1 = df.mask(df.lt(0))
print (df1)
a b c
0 2.0 NaN 8
1 NaN NaN 0
df2 = df.mask(df.le(0))
print (df2)
a b c
0 2.0 NaN 8.0
1 NaN NaN NaN
I have a big dataframe with 4 columns with often 3 null values at every row. Sometimes there are 2 or 1 or even 0 null values but often 3.
I want to transform it to a two columns dataframe having in each row the non null value and the name of the column from which it was extracted.
Example: How to transform this dataframe
df
Out[1]:
a b c d
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 NaN NaN 3.0 2.0
3 NaN NaN 1.0 NaN
to this One:
resultDF
Out[2]:
value columnName
0 1 a
1 2 b
2 3 c
3 2 d
4 1 c
The goal is to do it without looping on rows. Is this possible?
You can use pd.melt for adjusting the dataframe :
import pandas as pd
# reading the csv
df = pd.read_csv('test.csv')
df = df.melt(value_vars=['a','b','c','d'], var_name='foo', value_name='foo_value')
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Output :
foo foo_value
0 a 1.0
1 b 2.0
2 c 3.0
3 c 1.0
4 d 2.0
I have a DataFrame with some NaN values in all columns (Totally 3 columns). I want to populate the NaN values in each cell with the latest valid values in other rows with the fastest approach.
As an example if column A is NaN and column B is '123', I want to find the latest value in column A when the column B is '123' and populate the NaN value with that latest value.
I know it's easy to do this with a loop but I'm thinking regarding to the performance in a DataFrame with 25 mil records.
Any thought could help.
This solution uses for loop, but it loops over values of A where it is NaN.
A = The column containing NaNs
B = The column to be referenced
import pandas as pd
import numpy as np
#Consider this dataframe
df = pd.DataFrame({'A':[1,2,3,4,np.nan,6,7,8,np.nan,10],'B':['xxxx','b','xxxx','d','xxxx','f','yyyy','h','yyyy','j']})
A B
0 1.0 xxxx
1 2.0 b
2 3.0 xxxx
3 4.0 d
4 NaN xxxx
5 6.0 f
6 7.0 yyyy
7 8.0 h
8 NaN yyyy
9 10.0 j
for i in list(df.loc[np.isnan(df.A)].index): #looping over indexes where A in NaN
#dict with the keys as B and values as A
#here the dict keys will be unique and latest entries of B, hence having latest corresponding A values
dictionary = df.iloc[:i+1].dropna().set_index('B').to_dict()['A']
df.iloc[i,0] = dictionary[df.iloc[i,1]] #using the dict to change the value of A
This is how the df looks after executing the the code
A B
0 1.0 xxxx
1 2.0 b
2 3.0 xxxx
3 4.0 d
4 3.0 xxxx
5 6.0 f
6 7.0 yyyy
7 8.0 h
8 7.0 yyyy
9 10.0 j
Notice how at index = 4, A's values gets changed to 3.0 and not 1.0
I have a pd.dataframe that looks like this:
key_value a b c d e
value_01 1 10 x NaN NaN
value_01 NaN 12 NaN NaN NaN
value_01 NaN 7 NaN NaN NaN
value_02 7 4 y NaN NaN
value_02 NaN 5 NaN NaN NaN
value_02 NaN 6 NaN NaN NaN
value_03 19 15 z NaN NaN
So now based on the key_value,
For column 'a' & 'c', I want to copy over the last cell's value from the same column 'a' & 'c' based off of the key_value.
For another column 'd', I want to copy over the row 'i - 1' cell value from column 'b' to column 'd' i'th cell.
Lastly, for column 'e' I want to copy over the sum of 'i - 1' cell's from column 'b' to column 'e' i'th cell .
For every key_value the columns 'a', 'b' & 'c' have some value in their first row, based on which the next values are being copied over or for different columns the values are being generated for.
key_value a b c d e
value_01 1 10 x NaN NaN
value_01 1 12 x 10 10
value_01 1 7 x 12 22
value_02 7 4 y NaN NaN
value_02 7 5 y 4 4
value_02 7 6 y 5 9
value_03 8 15 z NaN NaN
My current approach:
size = df.key_value.size
for i in range(size):
if pd.isna(df.a[i]) and df.key_value[i] == output.key_value[i - 1]:
df.a[i] = df.a[i - 1]
df.c[i] = df.c[i - 1]
df.d[i] = df.b[i - 1]
df.e[i] = df.e[i] + df.b[i - 1]
For columns like 'a' and 'b' the NaN values are all in the same row indexes.
My approach works but takes very long since my datframe has over 50000 records, I was wondering if there is a different way to do this, since I have multiple columns like 'a' & 'b' where values need to be copied over based on 'key_value' and some columns where the values are being computed using say a column like 'b'
pd.concat with groupby and assign
pd.concat([
g.ffill().assign(d=lambda d: d.b.shift(), e=lambda d: d.d.cumsum())
for _, g in df.groupby('key_value')
])
key_value a b c d e
0 value_01 1.0 1 x NaN NaN
1 value_01 1.0 2 x 1.0 1.0
2 value_01 1.0 3 x 2.0 3.0
3 value_02 7.0 4 y NaN NaN
4 value_02 7.0 5 y 4.0 4.0
5 value_02 7.0 6 y 5.0 9.0
6 value_03 19.0 7 z NaN NaN
groupby and apply
def h(g):
return g.ffill().assign(
d=lambda d: d.b.shift(), e=lambda d: d.d.cumsum())
df.groupby('key_value', as_index=False, group_keys=False).apply(h)
You can use groupby + ffill for the groupwise filling. The other operations require shift and cumsum.
In general, note that many common operations have been implemented efficiently in Pandas.
g = df.groupby('key_value')
df['a'] = g['a'].ffill()
df['c'] = g['c'].ffill()
df['d'] = df['b'].shift()
df['e'] = df['d'].cumsum()
print(df)
key_value a b c d e
0 value_01 1.0 1 x NaN NaN
1 value_01 1.0 2 x 1.0 1.0
2 value_01 1.0 3 x 2.0 3.0
3 value_02 7.0 4 y 3.0 6.0
4 value_02 7.0 5 y 4.0 10.0
5 value_02 7.0 6 y 5.0 15.0
6 value_03 19.0 7 z 6.0 21.0
Suppose i have a DataFrame:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
which looks like
CATEGORY VALUE
0 a NaN
1 b 1
2 c 0
3 b 0
4 b 5
5 a 0
6 b 4
I group it:
df = df.groupby(by='CATEGORY')
And now, let me show, what i want with the help of example on one group 'b':
df.get_group('b')
group b:
CATEGORY VALUE
1 b 1
3 b 0
4 b 5
6 b 4
I need: In the scope of each group, count diff() between VALUE values, skipping all NaNs and 0s. So the result should be:
CATEGORY VALUE DIFF
1 b 1 -
3 b 0 -
4 b 5 4
6 b 4 -1
You can use diff to subtract values after dropping 0 and NaN values:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
grouped = df.groupby("CATEGORY")
# define diff func
diff = lambda x: x["VALUE"].replace(0, np.NaN).dropna().diff()
df["DIFF"] = grouped.apply(diff).reset_index(0, drop=True)
print(df)
CATEGORY VALUE DIFF
0 a NaN NaN
1 b 1.0 NaN
2 c 0.0 NaN
3 b 0.0 NaN
4 b 5.0 4.0
5 a 0.0 NaN
6 b 4.0 -1.0
Sounds like a job for a pd.Series.shift() operation along with a notnull mask.
First we remove the unwanted values, before we group the data
nonull_df = df[(df['VALUE'] != 0) & df['VALUE'].notnull()]
groups = nonull_df.groupby(by='CATEGORY')
Now we can shift internally in the groups and calculate the diff
nonull_df['next_value'] = groups['VALUE'].shift(1)
nonull_df['diff'] = nonull_df['VALUE'] - nonull_df['next_value']
Lastly and optionally you can copy the data back to the original dataframe
df.loc[nonull_df.index] = nonull_df
df
CATEGORY VALUE next_value diff
0 a NaN NaN NaN
1 b 1.0 NaN NaN
2 c 0.0 NaN NaN
3 b 0.0 1.0 -1.0
4 b 5.0 1.0 4.0
5 a 0.0 NaN NaN
6 b 4.0 5.0 -1.0