I have a dataframe like
df2 = pandas.DataFrame(data=[[1,4],[2,2],[2,1],[5,2],[5,3]],columns=['A','B'])
df2
Out[117]:
A B
0 1 4
1 2 2
2 2 1
3 5 2
4 5 3
and I would like to add nan to the column B if consecutive values are missing in column A
the dataframe should become as
df2
Out[117]:
A B
0 1 4
1 2 2
2 2 1
4 3 np.nan
5 4 np.nan
6 5 2
7 5 3
Could you please help me?
You can construct a dataframe to append, concatenate, then sort:
df = pd.DataFrame(data=[[1,4],[2,2],[2,1],[5,2],[5,3]], columns=['A','B'])
# construct dataframe to append
arr = np.arange(df['A'].min(), df['A'].max() + 1)
arr = arr[~np.in1d(arr, df['A'].values)]
df_append = pd.DataFrame({'A': arr})
# concatenate and sort
res = pd.concat([df, df_append]).sort_values('A')
print(res)
A B
0 1 4.0
1 2 2.0
2 2 1.0
0 3 NaN
1 4 NaN
3 5 2.0
4 5 3.0
Related
I have a timeseries with several products. For each product I want to remove null extremes, and in the middle I want to substitute double 0 to np.nan. Here is an example:
Date Id Units Should be
1 a 0 remove row
2 a 5 5
3 a 0 np.nan
4 a 0 np.nan
5 a 1 1
6 a 3 3
1 b 4 4
2 b 2 2
3 b 0 0
4 b 4 4
5 b 0 remove row
6 b 0 remove row
I tried using groupby and for to getting indexes, but I wasnt able to combine the rules.
You can use:
## PART 1: remove the external 0s
# get rows with 0
m = df['Units'].ne(0)
# get masks to identify the middle values
m1 = m.groupby(df['Id']).cummax()
m2 = m[::-1].groupby(df['Id']).cummax()
# slice the "internal" rows
out = df[m1&m2]
## PART2: replace stretches of 2 0s
g = m.ne(m.groupby(df['Id']).shift()).cumsum()
m3 = df.groupby(['Id', g]).transform('size').eq(2)
out.loc[m2&~m, 'Units'] = pd.NA
output:
Date Id Units Should be
1 2 a 5.0 5
2 3 a NaN np.nan
3 4 a NaN np.nan
4 5 a 1.0 1
5 6 a 3.0 3
6 1 b 4.0 4
7 2 b 2.0 2
8 3 b NaN 0
9 4 b 4.0 4
Can any one tell me a way to add a new column and data to an existing dataframe , similar to that shown below. When i enter a new column Name and value , it should add a column with new value at the last and zeroes in all other places, as shown below in pandas dataframe
DataFrame :
A B C
1 2 3
4 5 6
Enter New Column Name: D
Enter New Value: 7
New DataFrame
A B C D
1 2 3 0
4 5 6 0
0 0 0 7
You can create the append df with concat
out = pd.concat([df,pd.DataFrame({'D':[7]})]).fillna(0)
out
A B C D
0 1.0 2.0 3.0 0.0
1 4.0 5.0 6.0 0.0
0 0.0 0.0 0.0 7.0
Other solution, with .append:
print(df.append({"D": 7}, ignore_index=True).fillna(0).astype(int))
Prints:
A B C D
0 1 2 3 0
1 4 5 6 0
2 0 0 0 7
We can also use .loc with .fillna():
df.loc[df.shape[0], 'D'] = 7
df = df.fillna(0, downcast='infer')
Result:
print(df)
A B C D
0 1 2 3 0
1 4 5 6 0
2 0 0 0 7
There is a way to have a single first available match join? something that will create the final df inside the functions 'some_magic_merge'
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'joincol':['a','a','b','b','b','b','d','d'],
'val1':[1,2,3,4,5,6,7,8]})
df2 = pd.DataFrame({'joincol':['a','a','a','b','b','d'],
'val2':[1,2,3,4,5,6]})
final_df = some_magic_merge(df1,df2)
print(final_df)
print(df1)
print(df2)
output final df
joincol val1 val2
0 a 1 1.0
1 a 2 2.0
2 b 3 4.0
3 b 4 5.0
4 b 5 NaN
5 b 6 NaN
6 d 7 6.0
7 d 8 NaN
output df1 and df2
joincol val1
0 a 1
1 a 2
2 b 3
3 b 4
4 b 5
5 b 6
6 d 7
7 d 8
joincol val2
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 d 6
Use GroupBy.cumcount for helper columns filled by counter and then left join in merge:
final_df = pd.merge(df1.assign(g=df1.groupby('joincol').cumcount()),
df2.assign(g=df2.groupby('joincol').cumcount()),
how='left', on=['joincol','g']).drop('g', axis=1)
print(final_df)
joincol val1 val2
0 a 1 1.0
1 a 2 2.0
2 b 3 4.0
3 b 4 5.0
4 b 5 NaN
5 b 6 NaN
6 d 7 6.0
7 d 8 NaN
Lets say I have dataframe with nans in each group like
df = pd.DataFrame({'data':[0,1,2,0,np.nan,2,np.nan,0,1],'group':[1,1,1,2,2,2,3,3,3]})
and a numpy array like
x = np.array([0,1,2])
Now based on groups how to fill the missing values that are in the numpy array I have i.e
df = pd.DataFrame({'data':[0,1,2,0,1,2,2,0,1],'group':[1,1,1,2,2,2,3,3,3]})
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 2 3
7 0 3
8 1 3
Let me explain a bit of how the data should be filled. Consider the group 2. The values of data are 0,np.nan,2 . The np.nan is the missing value from the array [0,1,2]. So the data to be filled inplace of nan is 1.
For multiple nan values, take a group for example that has data [np.nan,0,np.nan] now the values to be filled in place of nan are 1 and 2. resulting in [1,0,2].
First find value which miss and then add it to fillna:
def f(y):
a = list(set(x)-set(y))
a = 1 if len(a) == 0 else a[0]
y = y.fillna(a)
return (y)
df['data'] = df.groupby('group')['data'].apply(f).astype(int)
print (df)
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 2 3
7 0 3
8 1 3
EDIT:
df = pd.DataFrame({'data':[0,1,2,0,np.nan,2,np.nan,np.nan,1, np.nan, np.nan, np.nan],
'group':[1,1,1,2,2,2,3,3,3,4,4,4]})
x = np.array([0,1,2])
print (df)
data group
0 0.0 1
1 1.0 1
2 2.0 1
3 0.0 2
4 NaN 2
5 2.0 2
6 NaN 3
7 NaN 3
8 1.0 3
9 NaN 4
10 NaN 4
11 NaN 4
def f(y):
a = list(set(x)-set(y))
if len(a) == 1:
return y.fillna(a[0])
elif len(a) == 2:
return y.fillna(a[0], limit=1).fillna(a[1])
elif len(a) == 3:
y = pd.Series(x, index=y.index)
return y
else:
return y
df['data'] = df.groupby('group')['data'].apply(f).astype(int)
print (df)
data group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 0 3
7 2 3
8 1 3
9 0 4
10 1 4
11 2 4
As an extension of my previous question, I would like take a DataFrame like the one below and find the correct row from which to pull data from column C and place it into column D based upon the following criteria:
B_new = 2*A_old -B_old, ie. the new row needs to have a B equal to the following result from the old row: 2*A - B.
Where A is the same, ie. A in the new row should have the same value as the old row.
Any values not found should use a NaN result
Code:
import pandas as pd
a = [2,2,2,3,3,3,3]
b = [1,2,3,1,3,4,5]
c = [0,1,2,3,4,5,6]
df = pd.DataFrame({'A': a , 'B': b, 'C':c})
print(df)
A B C
0 2 1 0
1 2 2 1
2 2 3 2
3 3 1 3
4 3 3 4
5 3 4 5
6 3 5 6
Desired output:
A B C D
0 2 1 0 2.0
1 2 2 1 1.0
2 2 3 2 0.0
3 3 1 3 6.0
4 3 3 4 4.0
5 3 4 5 NaN
6 3 5 6 3.0
Based upon the solutions in my previous question, I've come up with a method that uses a for loop to move thru each unique value of A:
for i in df.A.unique():
mapping = dict(df[df.A==i][['B', 'C']].values)
df.loc[df.A==i,'D'] = (2 * df[df.A==i]['A'] - df[df.A==i]['B']).map(mapping)
However, this seem clunky and I suspect there is a better way that doesn't make use of for loops, which from my prior experience tend to be slow.
Question:
What's the fastest way to accomplish this transfer of data within the DataFrame?
You could
In [370]: (df[['A', 'C']].assign(B=2*df.A - df.B)
.merge(df, how='left', on=['A', 'B'])
.assign(B=df.B)
.rename(columns={'C_x': 'C', 'C_y': 'D'}) )
Out[370]:
A C B D
0 2 0 1 2.0
1 2 1 2 1.0
2 2 2 3 0.0
3 3 3 1 6.0
4 3 4 3 4.0
5 3 5 4 NaN
6 3 6 5 3.0
Details:
In [372]: df[['A', 'C']].assign(B=2*df.A - df.B)
Out[372]:
A C B
0 2 0 3
1 2 1 2
2 2 2 1
3 3 3 5
4 3 4 3
5 3 5 2
6 3 6 1
In [373]: df[['A', 'C']].assign(B=2*df.A - df.B).merge(df, how='left', on=['A', 'B'])
Out[373]:
A C_x B C_y
0 2 0 3 2.0
1 2 1 2 1.0
2 2 2 1 0.0
3 3 3 5 6.0
4 3 4 3 4.0
5 3 5 2 NaN
6 3 6 1 3.0