Interpolate NaN values over a DataFrame as a ring - python

I need to interpolate the NaN values over a Dataframe but I want that interpolation to get the first values of the DataFrame in case the NaN value is the last value. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"a": [1,2,3], "b":[1,2,np.nan]})
So the DataFrame is:
a b
0 1 1.0
1 2 2.0
2 3 NaN
But when I interpolate the nan values like:
df.interpolate(method="linear", inplace=True)
I got:
a b
0 1 1.0
1 2 2.0
2 3 2.0
The interpolation doesn't use the first value to do it. My desired output wold be to fill in with the value of 1.5 because of that circular interpolation.

One possible solution is add first row, interpolate and remove last row:
df = df.append(df.iloc[0]).interpolate(method="linear").iloc[:-1]
print (df)
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 1.5
EDIT:
More general solution:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df = pd.concat([df] * 3).interpolate(method="linear").iloc[len(df):-len(df)]
print (df)
a b
0 1 1.333333
1 2 1.000000
2 3 2.000000
3 4 1.666667
Or if need working only with last non missing values:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df1 = df.ffill().iloc[[-1]]
df2 = df.bfill().iloc[[0]]
df = pd.concat([df1, df, df2]).interpolate(method="linear").iloc[1:-1]
print (df)
a b
0 1 1.5
1 2 1.0
2 3 2.0
3 4 1.5

Related

How to really filter a pandas dataset without leaving Nans everywhere

Say I have a huge DataFrame that only contains a handful of cells that match the filtering I perform. How can I end up with only the values that match it (and their indexes and columns) in a new dataframe without the entire other DataFrame that becomes Nan. Dropping Nan's with dropna just removes the whole column or row and filter replaces non matches with Nans.
Here's my code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((1000, 1000)))
# this one is almost filled with Nans
df[df<0.01]
If need non missing values in another format you can use DataFrame.stack:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(5, 3)))
# this one is almost filled with Nans
df1 = df[df<7]
print (df1)
0 1 2
0 0.0 NaN 3.0
1 6.0 3.0 3.0
2 NaN NaN 0.0
3 0.0 NaN NaN
4 3.0 NaN 2.0
df2 = df1.stack().rename_axis(('a','b')).reset_index(name='c')
print (df2)
a b c
0 0 0 0.0
1 0 2 3.0
2 1 0 6.0
3 1 1 3.0
4 1 2 3.0
5 2 2 0.0
6 3 0 0.0
7 4 0 3.0
8 4 2 2.0

transform a big dataframe with many None values to smaller one with indication of non null columns

I have a big dataframe with 4 columns with often 3 null values at every row. Sometimes there are 2 or 1 or even 0 null values but often 3.
I want to transform it to a two columns dataframe having in each row the non null value and the name of the column from which it was extracted.
Example: How to transform this dataframe
df
Out[1]:
a b c d
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 NaN NaN 3.0 2.0
3 NaN NaN 1.0 NaN
to this One:
resultDF
Out[2]:
value columnName
0 1 a
1 2 b
2 3 c
3 2 d
4 1 c
The goal is to do it without looping on rows. Is this possible?
You can use pd.melt for adjusting the dataframe :
import pandas as pd
# reading the csv
df = pd.read_csv('test.csv')
df = df.melt(value_vars=['a','b','c','d'], var_name='foo', value_name='foo_value')
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Output :
foo foo_value
0 a 1.0
1 b 2.0
2 c 3.0
3 c 1.0
4 d 2.0

Python column difference

I need to create a column that computes the difference between another column's elements:
Column A Computed Column
10 blank # nothing to compute for first record
9 1 # = 10-9
7 2 # = 9-7
4 3 # = 7-4
I am assuming this is a lambda function, but i am not sure how to reference the elements in 'Column A'
Any help/direction you can provide would be great- thanks!
You can do it by shifting the column.
import pandas as pd
dict1 = {'A': [10,9,7,4]}
df = pd.DataFrame.from_dict(dict1)
df['Computed'] = df['A'].shift() - df['A']
print(df)
giving
A Computed
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
EDIT: OP extended his requirement to multi columns
dict1 = {'A': [10,9,7,4], 'B': [10,9,7,4], 'C': [10,9,7,4]}
df = pd.DataFrame.from_dict(dict1)
columns_to_update = ['A', 'B']
for col in columns_to_update:
df['Computed'+col] = df[col].shift() - df[col]
print(df)
By using the columns_to_update, you can choose the columns you want.
A B C ComputedA ComputedB
0 10 10 10 NaN NaN
1 9 9 9 1.0 1.0
2 7 7 7 2.0 2.0
3 4 4 4 3.0 3.0
Use diff.
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = df.A.diff(-1).shift(1)
Output:
df
Out[140]:
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
I would just do:
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = abs(df['A'].diff())
The reason for abs() is because diff() computes the difference between current - previous whereas you want previous - current. This method is already built-in to the Series class, so using abs() will get you the correct result by taking the absolute value either way.
To support:
import pandas as pd
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = abs(df['A'].diff())
>>> df
# Output
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
df2 = pd.DataFrame(data=[10,4,7,9], columns=['A'])
df2['B'] = abs(df2['A'].diff())
>>> df2
# Output
A B
0 10 NaN
1 4 6.0
2 7 3.0
3 9 2.0
To still out perform that of #cosmic_inquiry's solution:
import pandas as pd
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df2 = pd.DataFrame(data=[10,4,7,9], columns=['A'])
df['B'] = df['A'].diff() * -1
df2['B'] = df2['A'].diff() * -1
>>> df
# Output:
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
>>> df2
# Output:
A B
0 10 NaN
1 4 6.0
2 7 -3.0
3 9 -2.0

Delete values from pandas dataframe based on logical operation

I want to delete the values that are greater than a certain threshold from a pandas dataframe. Is there an efficient way to perform this? I am doing it with apply and lambda, which works fine but a bit slow for a large dataframe and I feel like there must be a better method.
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df
A B
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
How can this be done without apply and lambda?
df['A'] = df.apply(lambda x: x['A'] if x['A'] < 3 else None, axis=1)
df
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5
Use a boolean mask against the df:
In[21]:
df[df<3]
Out[21]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
Here where the boolean condition is not met a False is returned, this will just mask out the df value returning NaN
If you actually want to drop these rows then self-assign:
df = df[df<3]
To compare a specific column:
In[22]:
df[df['A']<3]
Out[22]:
A
0 1
1 2
If you want NaN in the removed rows then you can use a trick where a double square brackets will return a single column df so we can mask the df:
In[25]:
df[df[['A']]<3]
Out[25]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
If you have multiple columns then the above won't work as the boolean mask has to match the orig df, in which case you can reindex against the orig df index:
In[31]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df[df['A']<3].reindex(df.index)
Out[31]:
A B
0 1.0 1.0
1 2.0 2.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
EDIT
You've updated your question again, if you want to just overwrite the single column:
In[32]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df['A'] = df.loc[df['A'] < 3,'A']
df
Out[32]:
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5

Pandas: filling missing values by weighted average in each group

I have a dataFrame where 'value'column has missing values. I'd like to filling missing values by weighted average within each 'name' group. There was post on how to fill the missing values by simple average in each group but not weighted average. Thanks a lot!
df = pd.DataFrame({'value': [1, np.nan, 3, 2, 3, 1, 3, np.nan, np.nan],'weight':[3,1,1,2,1,2,2,1,1], 'name': ['A','A', 'A','B','B','B', 'C','C','C']})
name value weight
0 A 1.0 3
1 A NaN 1
2 A 3.0 1
3 B 2.0 2
4 B 3.0 1
5 B 1.0 2
6 C 3.0 2
7 C NaN 1
8 C NaN 1
I'd like to fill in "NaN" with weighted value in each "name" group, i.e.
name value weight
0 A 1.0 3
1 A 1.5 1
2 A 3.0 1
3 B 2.0 2
4 B 3.0 1
5 B 1.0 2
6 C 3.0 2
7 C 3.0 1
8 C 3.0 1
You can group data frame by name, and use fillna method to fill the missing values with weighted average which can calculated with np.average with weights parameter:
df['value'] = (df.groupby('name', group_keys=False)
.apply(lambda g: g.value.fillna(np.average(g.dropna().value, weights=g.dropna().weight))))
df
#name value weight
#0 A 1.0 3
#1 A 1.5 1
#2 A 3.0 1
#3 B 2.0 2
#4 B 3.0 1
#5 B 1.0 2
#6 C 3.0 2
#7 C 3.0 1
#8 C 3.0 1
To make this less convoluted, define a fillValue function:
import numpy as np
import pandas as pd
def fillValue(g):
gNotNull = g.dropna()
wtAvg = np.average(gNotNull.value, weights=gNotNull.weight)
return g.value.fillna(wtAvg)
df['value'] = df.groupby('name', group_keys=False).apply(fillValue)

Categories