I need to subtract two Data Frames with different indexes (which causes 'NaN' values when one of the values is missing) and I want to replace the missing values from each Data Frame with different number (fill value).
For example, let's say I have df1 and df2:
df1:
A B C
0 0 3 0
1 0 0 4
2 4 0 2
df2:
A B C
0 0 3 0
1 1 2 0
3 1 2 0
subtracted = df1.sub(df2):
A B C
0 0 0 0
1 -1 -2 4
2 NaN NaN NaN
3 NaN NaN NaN
I want the second row of subtracted to have the values from the second row in df1 and the third row of subtracted to have the value 5.
I expect -
subtracted:
A B C
0 0 0 0
1 -1 -2 4
2 4 0 2
3 5 5 5
I tried using the method sub with fill_value=5 but than in both rows 2 and 3 I'll get 0.
One way would be to reindex df2 setting fill_value to 0 before subtracting, then subtract and fillna with 5:
ix = pd.RangeIndex((df1.index|df2.index).max()+1)
df1.sub(df2.reindex(ix, fill_value=0)).fillna(5).astype(df1.dtypes)
A B C
0 0 0 0
1 -1 -2 4
2 4 0 2
3 5 5 5
We have to reindex here to get alligned indices. This way we can use the sub method.
idxmin = df2.index.min()
idxmax = df2.index.max()
idx = np.arange(idxmin, idxmax+1)
df1.reindex(idx).sub(df2.reindex(idx).fillna(0)).fillna(5)
A B C
0 0.0 0.0 0.0
1 -1.0 -2.0 4.0
2 4.0 0.0 2.0
3 5.0 5.0 5.0
I found the combine_first method that almost satisfies my needs:
df2.combine_first(df1).sub(df2, fill_value=0)
but still produces only:
A B C
0 0 0 0
1 0 0 0
2 4 0 2
3 0 0 0
Related
I have dataframe like this.
col1
0 1
1 3
2 3
3 1
4 2
5 3
6 2
7 2
I want to create column out by compare each row. If row 0 less than row 1 then out is 1. If row 1 more than row 2 then out is 0. like this sample.
col1 out
0 1 1 # 1<3 = 1
1 3 0 # 3<3 = 0
2 3 0 # 3<1 = 0
3 1 1 # 1<2 = 1
4 2 1 # 2<3 = 1
5 3 0 # 3<2 = 0
6 2 0 # 2<2 = 0
7 2 -
I try with this code.
def comp_out(a):
return np.concatenate(([1],a[1:] > a[2:]))
df['out'] = comp_out(df.col1.values)
It show error like this.
ValueError: operands could not be broadcast together with shapes (11,) (10,)
Let's use shift instead to "shift" the column up so that rows are aligned with the previous, then use lt to compare less than and astype convert the booleans to 1/0:
df['out'] = df['col1'].lt(df['col1'].shift(-1)).astype(int)
col1 out
0 1 1
1 3 0
2 3 0
3 1 1
4 2 1
5 3 0
6 2 0
7 2 0
We can strip the last value with iloc if needed:
df['out'] = df['col1'].lt(df['col1'].shift(-1)).iloc[:-1].astype(int)
df:
col1 out
0 1 1.0
1 3 0.0
2 3 0.0
3 1 1.0
4 2 1.0
5 3 0.0
6 2 0.0
7 2 NaN
If we want to use the function we should make sure both are the same length, by slicing off the last value:
def comp_out(a):
return np.concatenate([a[0:-1] < a[1:], [np.NAN]])
df['out'] = comp_out(df['col1'].to_numpy())
df:
col1 out
0 1 1.0
1 3 0.0
2 3 0.0
3 1 1.0
4 2 1.0
5 3 0.0
6 2 0.0
7 2 NaN
Can any one tell me a way to add a new column and data to an existing dataframe , similar to that shown below. When i enter a new column Name and value , it should add a column with new value at the last and zeroes in all other places, as shown below in pandas dataframe
DataFrame :
A B C
1 2 3
4 5 6
Enter New Column Name: D
Enter New Value: 7
New DataFrame
A B C D
1 2 3 0
4 5 6 0
0 0 0 7
You can create the append df with concat
out = pd.concat([df,pd.DataFrame({'D':[7]})]).fillna(0)
out
A B C D
0 1.0 2.0 3.0 0.0
1 4.0 5.0 6.0 0.0
0 0.0 0.0 0.0 7.0
Other solution, with .append:
print(df.append({"D": 7}, ignore_index=True).fillna(0).astype(int))
Prints:
A B C D
0 1 2 3 0
1 4 5 6 0
2 0 0 0 7
We can also use .loc with .fillna():
df.loc[df.shape[0], 'D'] = 7
df = df.fillna(0, downcast='infer')
Result:
print(df)
A B C D
0 1 2 3 0
1 4 5 6 0
2 0 0 0 7
I've got a df ('c' always starts with 0 and 'y' always starts with 1 in each group):
a b c d
0 1 0 1
0 1 0 1
0 2 0 1
0 2 0 0
1 3 0 1
1 3 1 0
1 2 0 1
1 2 0 0
The groups (grouped by 'a' and 'b') in columns 'c' and 'd' are the alternatives to pick-
End value of each group should alternate with first of next group (transition between subgroups doesn't matter).
For the first group in a subgroup ('a') pick column 'c'.
Result should be:
a b c d
0 1 0 nan
0 1 0 nan
0 2 nan 1
0 2 nan 0
1 3 0 nan
1 3 1 nan
1 2 0 nan
1 2 0 nan
Any ideas? Maybe sth like pick group in column c and if they are both 0, next group in c is nan nan.
Here is my question. I don't know how to describe it, so I will just give an example.
a b k
0 0 0
0 1 1
0 2 0
0 3 0
0 4 1
0 5 0
1 0 0
1 1 1
1 2 0
1 3 1
1 4 0
Here, "a" is user id, "b" is time, and "k" is a binary indicator flag. "b" is consecutive for sure.
What I want to get is this:
a b k diff_b
0 0 0 nan
0 1 1 nan
0 2 0 1
0 3 0 2
0 4 1 3
0 5 0 1
1 0 0 nan
1 1 1 nan
1 2 0 1
1 3 1 2
1 4 0 1
So, diff_b is a time difference variable. It shows the duration between the current time point and the last time point with an action. If there is never an action before, it returns nan. This diff_b is grouped by a. For each user, this diff_b is calculated independently.
Can anyone revise my title? I don't know how to describe it in english. So complex...
Thank you!
IIUC
df['New']=df.b.loc[df.k==1]# get all value b when k equal to 1
df.New=df.groupby('a').New.apply(lambda x : x.ffill().shift()) # fillna by froward method , then we need shift.
df.b-df['New']# yield
Out[260]:
0 NaN
1 NaN
2 1.0
3 2.0
4 3.0
5 1.0
6 NaN
7 NaN
8 1.0
9 2.0
10 1.0
dtype: float64
create partitions of the data of rows after k == 1 up to the next k == 1 using cumsum, and shift, for each group of a
parts = df.groupby('a').k.apply(lambda x: x.shift().cumsum())
group by the df.a & parts and calculate the difference between b & b.min() within each group
vals = df.groupby([df.a, parts]).b.apply(lambda x: x-x.min()+1)
set values to null when part == 0 & assign back to the dataframe
df['diff_b'] = np.select([parts!=0], [vals], np.nan)
outputs:
a b k diff_b
0 0 0 0 NaN
1 0 1 1 NaN
2 0 2 0 1.0
3 0 3 0 2.0
4 0 4 1 3.0
5 0 5 0 1.0
6 1 0 0 NaN
7 1 1 1 NaN
8 1 2 0 1.0
9 1 3 1 2.0
10 1 4 0 1.0
I am trying to convert a data set with 100,000 rows and 3 columns into pivot. While the following code runs without an error, the values are displayed as NaN.
df1 = pd.pivot_table(df_TEST, values='actions', index=['sku'], columns=['user'])
It is not taking the values (ranges from 1 to 36 ) from DataFrame. Has anyone come across this situation?
This can happen when you are doing a pivot since not all the values might be present. e.g.
In [10]: df_TEST
Out[10]:
a b c
0 0 0 0
1 0 1 0
2 0 2 0
3 1 1 1
4 1 2 3
5 1 4 5
Now, when you do pivot on this,
In [9]: df_TEST.pivot_table(index='a', values='c', columns='b')
Out[9]:
b 0 1 2 4
a
0 0 0 0 NaN
1 NaN 1 3 5
Note that, you got NaN at index 0 and column 4, since there is no entry in df_TEST with column a = 0 and column b = 4.
Typically you fill such values with zeros.
In [11]: df_TEST.pivot_table(index='a', values='c', columns='b').fillna(0)
Out[11]:
b 0 1 2 4
a
0 0 0 0 0
1 0 1 3 5