add nan if missing consecutive values

add nan if missing consecutive values - python

I have a dataframe like
df2 = pandas.DataFrame(data=[[1,4],[2,2],[2,1],[5,2],[5,3]],columns=['A','B'])
df2
Out[117]:
A B
0 1 4
1 2 2
2 2 1
3 5 2
4 5 3
and I would like to add nan to the column B if consecutive values are missing in column A
the dataframe should become as
df2
Out[117]:
A B
0 1 4
1 2 2
2 2 1
4 3 np.nan
5 4 np.nan
6 5 2
7 5 3
Could you please help me?

You can construct a dataframe to append, concatenate, then sort:
df = pd.DataFrame(data=[[1,4],[2,2],[2,1],[5,2],[5,3]], columns=['A','B'])
# construct dataframe to append
arr = np.arange(df['A'].min(), df['A'].max() + 1)
arr = arr[~np.in1d(arr, df['A'].values)]
df_append = pd.DataFrame({'A': arr})
# concatenate and sort
res = pd.concat([df, df_append]).sort_values('A')
print(res)
A B
0 1 4.0
1 2 2.0
2 2 1.0
0 3 NaN
1 4 NaN
3 5 2.0
4 5 3.0

Related

Pandas time-series groupby with custom function

I have a timeseries with several products. For each product I want to remove null extremes, and in the middle I want to substitute double 0 to np.nan. Here is an example:
Date Id Units Should be
1 a 0 remove row
2 a 5 5
3 a 0 np.nan
4 a 0 np.nan
5 a 1 1
6 a 3 3
1 b 4 4
2 b 2 2
3 b 0 0
4 b 4 4
5 b 0 remove row
6 b 0 remove row
I tried using groupby and for to getting indexes, but I wasnt able to combine the rules.

You can use:
## PART 1: remove the external 0s
# get rows with 0
m = df['Units'].ne(0)
# get masks to identify the middle values
m1 = m.groupby(df['Id']).cummax()
m2 = m[::-1].groupby(df['Id']).cummax()
# slice the "internal" rows
out = df[m1&m2]
## PART2: replace stretches of 2 0s
g = m.ne(m.groupby(df['Id']).shift()).cumsum()
m3 = df.groupby(['Id', g]).transform('size').eq(2)
out.loc[m2&~m, 'Units'] = pd.NA
output:
Date Id Units Should be
1 2 a 5.0 5
2 3 a NaN np.nan
3 4 a NaN np.nan
4 5 a 1.0 1
5 6 a 3.0 3
6 1 b 4.0 4
7 2 b 2.0 2
8 3 b NaN 0
9 4 b 4.0 4

Adding new data and new column on a pandas dataframe with zero filled

Can any one tell me a way to add a new column and data to an existing dataframe , similar to that shown below. When i enter a new column Name and value , it should add a column with new value at the last and zeroes in all other places, as shown below in pandas dataframe
DataFrame :
A B C
1 2 3
4 5 6
Enter New Column Name: D
Enter New Value: 7
New DataFrame
A B C D
1 2 3 0
4 5 6 0
0 0 0 7

You can create the append df with concat
out = pd.concat([df,pd.DataFrame({'D':[7]})]).fillna(0)
out
A B C D
0 1.0 2.0 3.0 0.0
1 4.0 5.0 6.0 0.0
0 0.0 0.0 0.0 7.0

Other solution, with .append:
print(df.append({"D": 7}, ignore_index=True).fillna(0).astype(int))
Prints:
A B C D
0 1 2 3 0
1 4 5 6 0
2 0 0 0 7

We can also use .loc with .fillna():
df.loc[df.shape[0], 'D'] = 7
df = df.fillna(0, downcast='infer')
Result:
print(df)
A B C D
0 1 2 3 0
1 4 5 6 0
2 0 0 0 7

Pandas single match join

There is a way to have a single first available match join? something that will create the final df inside the functions 'some_magic_merge'
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'joincol':['a','a','b','b','b','b','d','d'],
'val1':[1,2,3,4,5,6,7,8]})
df2 = pd.DataFrame({'joincol':['a','a','a','b','b','d'],
'val2':[1,2,3,4,5,6]})
final_df = some_magic_merge(df1,df2)
print(final_df)
print(df1)
print(df2)
output final df
joincol val1 val2
0 a 1 1.0
1 a 2 2.0
2 b 3 4.0
3 b 4 5.0
4 b 5 NaN
5 b 6 NaN
6 d 7 6.0
7 d 8 NaN
output df1 and df2
joincol val1
0 a 1
1 a 2
2 b 3
3 b 4
4 b 5
5 b 6
6 d 7
7 d 8
joincol val2
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 d 6

Use GroupBy.cumcount for helper columns filled by counter and then left join in merge:
final_df = pd.merge(df1.assign(g=df1.groupby('joincol').cumcount()),
df2.assign(g=df2.groupby('joincol').cumcount()),
how='left', on=['joincol','g']).drop('g', axis=1)
print(final_df)
joincol val1 val2
0 a 1 1.0
1 a 2 2.0
2 b 3 4.0
3 b 4 5.0
4 b 5 NaN
5 b 6 NaN
6 d 7 6.0
7 d 8 NaN

How to fill values based on data present in column and an array? Pandas

Lets say I have dataframe with nans in each group like
df = pd.DataFrame({'data':[0,1,2,0,np.nan,2,np.nan,0,1],'group':[1,1,1,2,2,2,3,3,3]})
and a numpy array like
x = np.array([0,1,2])
Now based on groups how to fill the missing values that are in the numpy array I have i.e
df = pd.DataFrame({'data':[0,1,2,0,1,2,2,0,1],'group':[1,1,1,2,2,2,3,3,3]})
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 2 3
7 0 3
8 1 3
Let me explain a bit of how the data should be filled. Consider the group 2. The values of data are 0,np.nan,2 . The np.nan is the missing value from the array [0,1,2]. So the data to be filled inplace of nan is 1.
For multiple nan values, take a group for example that has data [np.nan,0,np.nan] now the values to be filled in place of nan are 1 and 2. resulting in [1,0,2].

First find value which miss and then add it to fillna:
def f(y):
a = list(set(x)-set(y))
a = 1 if len(a) == 0 else a[0]
y = y.fillna(a)
return (y)
df['data'] = df.groupby('group')['data'].apply(f).astype(int)
print (df)
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 2 3
7 0 3
8 1 3
EDIT:
df = pd.DataFrame({'data':[0,1,2,0,np.nan,2,np.nan,np.nan,1, np.nan, np.nan, np.nan],
'group':[1,1,1,2,2,2,3,3,3,4,4,4]})
x = np.array([0,1,2])
print (df)
data group
0 0.0 1
1 1.0 1
2 2.0 1
3 0.0 2
4 NaN 2
5 2.0 2
6 NaN 3
7 NaN 3
8 1.0 3
9 NaN 4
10 NaN 4
11 NaN 4
def f(y):
a = list(set(x)-set(y))
if len(a) == 1:
return y.fillna(a[0])
elif len(a) == 2:
return y.fillna(a[0], limit=1).fillna(a[1])
elif len(a) == 3:
y = pd.Series(x, index=y.index)
return y
else:
return y
df['data'] = df.groupby('group')['data'].apply(f).astype(int)
print (df)
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 0 3
7 2 3
8 1 3
9 0 4
10 1 4
11 2 4

Find data from row in Pandas DataFrame based upon calculated value?

As an extension of my previous question, I would like take a DataFrame like the one below and find the correct row from which to pull data from column C and place it into column D based upon the following criteria:
B_new = 2*A_old -B_old, ie. the new row needs to have a B equal to the following result from the old row: 2*A - B.
Where A is the same, ie. A in the new row should have the same value as the old row.
Any values not found should use a NaN result
Code:
import pandas as pd
a = [2,2,2,3,3,3,3]
b = [1,2,3,1,3,4,5]
c = [0,1,2,3,4,5,6]
df = pd.DataFrame({'A': a , 'B': b, 'C':c})
print(df)
A B C
0 2 1 0
1 2 2 1
2 2 3 2
3 3 1 3
4 3 3 4
5 3 4 5
6 3 5 6
Desired output:
A B C D
0 2 1 0 2.0
1 2 2 1 1.0
2 2 3 2 0.0
3 3 1 3 6.0
4 3 3 4 4.0
5 3 4 5 NaN
6 3 5 6 3.0
Based upon the solutions in my previous question, I've come up with a method that uses a for loop to move thru each unique value of A:
for i in df.A.unique():
mapping = dict(df[df.A==i][['B', 'C']].values)
df.loc[df.A==i,'D'] = (2 * df[df.A==i]['A'] - df[df.A==i]['B']).map(mapping)
However, this seem clunky and I suspect there is a better way that doesn't make use of for loops, which from my prior experience tend to be slow.
Question:
What's the fastest way to accomplish this transfer of data within the DataFrame?

You could
In [370]: (df[['A', 'C']].assign(B=2*df.A - df.B)
.merge(df, how='left', on=['A', 'B'])
.assign(B=df.B)
.rename(columns={'C_x': 'C', 'C_y': 'D'}) )
Out[370]:
A C B D
0 2 0 1 2.0
1 2 1 2 1.0
2 2 2 3 0.0
3 3 3 1 6.0
4 3 4 3 4.0
5 3 5 4 NaN
6 3 6 5 3.0
Details:
In [372]: df[['A', 'C']].assign(B=2*df.A - df.B)
Out[372]:
A C B
0 2 0 3
1 2 1 2
2 2 2 1
3 3 3 5
4 3 4 3
5 3 5 2
6 3 6 1
In [373]: df[['A', 'C']].assign(B=2*df.A - df.B).merge(df, how='left', on=['A', 'B'])
Out[373]:
A C_x B C_y
0 2 0 3 2.0
1 2 1 2 1.0
2 2 2 1 0.0
3 3 3 5 6.0
4 3 4 3 4.0
5 3 5 2 NaN
6 3 6 1 3.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

add nan if missing consecutive values - python

Related

Pandas time-series groupby with custom function

Adding new data and new column on a pandas dataframe with zero filled

Pandas single match join

How to fill values based on data present in column and an array? Pandas

Find data from row in Pandas DataFrame based upon calculated value?

Categories

Resources