Given the following dataframe
df = pd.DataFrame(data={'name': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'],
'lag': [1, 1, 1, 2, 2, 2, 2, 2, 2, 2],
'value': range(10)})
print(df)
lag name value
0 1 a 0
1 1 a 1
2 1 a 2
3 2 b 3
4 2 b 4
5 2 b 5
6 2 b 6
7 2 c 7
8 2 c 8
9 2 c 9
I am trying to shift values contained in column value to obtain the column expected_value, which is the shifted values grouped by column name and shifted by lag rows. I was thinking of using something like df['expected_value'] = df.groupby(['name', 'lag']).shift(), but I am not sure how to pass lag to the shift() function.
print(df)
lag name value expected_value
0 1 a 0 nan
1 1 a 1 0.0000
2 1 a 2 1.0000
3 2 b 3 nan
4 2 b 4 nan
5 2 b 5 3.0000
6 2 b 6 4.0000
7 2 c 7 nan
8 2 c 8 nan
9 2 c 9 7.0000
You can use GroupBy.transform here.
df.assign(expected_value = df.groupby(['name', 'lag'])['value'].
transform(lambda x: x.shift(x.name[1])))
name lag value expected_value
0 a 1 0 NaN
1 a 1 1 0.0
2 a 1 2 1.0
3 b 2 3 NaN
4 b 2 4 NaN
5 b 2 5 3.0
6 b 2 6 4.0
7 c 2 7 NaN
8 c 2 8 NaN
9 c 2 9 7.0
You can do with an apply:
df['new_val'] = (df.groupby('name')
.apply(lambda x: x['value'].shift(x['lag'].iloc[0]))
.reset_index('name',drop=True)
)
Output:
name lag value new_val
0 a 1 0 NaN
1 a 1 1 0.0
2 a 1 2 1.0
3 b 2 3 NaN
4 b 2 4 NaN
5 b 2 5 3.0
6 b 2 6 4.0
7 c 2 7 NaN
8 c 2 8 NaN
9 c 2 9 7.0
Related
Below is a toy Pandas dataframe that has three columns: 'id' (group id), 'b' (for condition), and 'c' (target):
df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [3,4,5,'A',3,4,'A',1,'A',1,3,'A',2,3],
'c' : [1,0,1,10,1,1,20,1,10,0,1,20,1,1]})
print(df)
id b c
0 1 3 1
1 1 4 0
2 1 5 1
3 1 A 10
4 1 3 1
5 1 4 1
6 1 A 20
7 2 1 1
8 2 A 10
9 2 1 0
10 2 3 1
11 2 A 20
12 2 2 1
13 2 3 1
For each group, I want to replace the values in column 'c' with nan (i.e., np.nan) before the first occurrence of 'A' in column 'b'.
The desired output is the following:
desired_output_df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [3,4,5,'A',3,4,'A',1,'A',1,3,'A',2,3],
'c' : [np.nan,np.nan,np.nan,10,1,1,20,np.nan,10,0,1,20,1,1]})
print(desired_output_df)
id b c
0 1 3 NaN
1 1 4 NaN
2 1 5 NaN
3 1 A 10.0
4 1 3 1.0
5 1 4 1.0
6 1 A 20.0
7 2 1 NaN
8 2 A 10.0
9 2 1 0.0
10 2 3 1.0
11 2 A 20.0
12 2 2 1.0
13 2 3 1.0
I am able to get the index of the values of column c that I want to change using the following command: df.groupby('id').apply(lambda x: x.loc[:(x.b == 'A').idxmax()-1]).index. But the result is a "MultiIndex" and I can't seem to use it to replace the values.
MultiIndex([(1, 0),
(1, 1),
(1, 2),
(2, 7)],
names=['id', None])
Thanks in advance.
Try:
df['c'] = np.where(df.groupby('id').apply(lambda x: x['b'].eq('A').cumsum()) > 0, df['c'], np.nan)
print(df)
Prints:
id b c
0 1 3 NaN
1 1 4 NaN
2 1 5 NaN
3 1 A 10.0
4 1 3 1.0
5 1 4 1.0
6 1 A 20.0
7 2 1 NaN
8 2 A 10.0
9 2 1 0.0
10 2 3 1.0
11 2 A 20.0
12 2 2 1.0
13 2 3 1.0
Let's say I have a DataFrame like:
import pandas as pd
df = pd.DataFrame({"Quarter": [1,2,3,4,1,2,3,4,4],
"Type": ["a","a","a","a","b","b","c","c","d"],
"Value": [4,1,3,4,7,2,9,4,1]})
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 c 9
7 4 c 4
8 4 d 1
For each Type, there needs to be a total of 4 rows that represent one of four quarters as indicated by the Quarter column. So, it would look like:
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b NaN
7 4 b NaN
8 1 c NaN
9 2 c NaN
10 3 c 9
11 4 c 4
12 1 d NaN
13 2 d NaN
14 3 d NaN
15 4 d 1
Then, where there are missing values in the Value column, fill the missing values using the next closest available value with the same Type (if it's the last quarter that is missing then fill with the third quarter):
Quarter Type Value
0 1 a 4
1 2 a 1
2 3 a 3
3 4 a 4
4 1 b 7
5 2 b 2
6 3 b 2
7 4 b 2
8 1 c 9
9 2 c 9
10 3 c 9
11 4 c 4
12 1 d 1
13 2 d 1
14 3 d 1
15 4 d 1
What's the best way to accomplish this?
Use reindex:
idx = pd.MultiIndex.from_product([
df['Type'].unique(),
range(1,5)
], names=['Type', 'Quarter'])
df.set_index(['Type', 'Quarter']).reindex(idx) \
.groupby('Type') \
.transform(lambda v: v.ffill().bfill()) \
.reset_index()
you can use set_index and unstack to create the missing rows you want (assuming each quarter is available in at least one type), then ffill and bfill over the columns and finally stack and reset_index to go back to the original shape
df = df.set_index(['Type', 'Quarter']).unstack()\
.ffill(axis=1).bfill(axis=1)\
.stack().reset_index()
print (df)
Type Quarter Value
0 a 1 4.0
1 a 2 1.0
2 a 3 3.0
3 a 4 4.0
4 b 1 7.0
5 b 2 2.0
6 b 3 2.0
7 b 4 2.0
8 c 1 9.0
9 c 2 9.0
10 c 3 9.0
11 c 4 4.0
12 d 1 1.0
13 d 2 1.0
14 d 3 1.0
15 d 4 1.0
I have the following DataFrame:
>>>> df = pd.DataFrame(data={
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'value': [0, 2, 3, 4, 0, 3, 2, 3, 0]})
>>> df
type value
0 A 0
1 A 2
2 A 3
3 B 4
4 B 0
5 B 3
6 C 2
7 C 3
8 C 0
What I need to accomplish is the following: for each type, trace the cumulative count of non-zero values but starting from zero each time a 0-value is encountered.
type value cumcount
0 A 0 NaN
1 A 2 1
2 A 3 2
3 B 4 1
4 B 0 NaN
5 B 3 1
6 C 2 1
7 C 3 2
8 C 0 NaN
Idea is create consecutive groups and filter out non 0 values, last assign to new column with filter:
m = df['value'].eq(0)
g = m.ne(m.shift()).cumsum()[~m]
df.loc[~m, 'new'] = df.groupby(['type',g]).cumcount().add(1)
print (df)
type value new
0 A 0 NaN
1 A 2 1.0
2 A 3 2.0
3 B 4 1.0
4 B 0 NaN
5 B 3 1.0
6 C 2 1.0
7 C 3 2.0
8 C 0 NaN
For pandas 0.24+ is possible use Nullable integer data type:
df['new'] = df['new'].astype('Int64')
print (df)
type value new
0 A 0 NaN
1 A 2 1
2 A 3 2
3 B 4 1
4 B 0 NaN
5 B 3 1
6 C 2 1
7 C 3 2
8 C 0 NaN
I have a dataframe that consists of truthIds and trackIds:
truthId = ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'C', 'B', 'A', 'A', 'C', 'C']
trackId = [1, 1, 2, 2, 3, 4, 5, 3, 2, 1, 5, 4, 6]
df1 = pd.DataFrame({'truthId': truthId, 'trackId': trackId})
trackId truthId
0 1 A
1 1 A
2 2 B
3 2 B
4 3 C
5 4 C
6 5 A
7 3 C
8 2 B
9 1 A
10 5 A
11 4 C
12 6 C
I wish to add a column that calculates, for each unique truthId, the length of the set of unique tracksIds that have previously (i.e. from the top of the data to that row) been associated with it:
truthId trackId unique_Ids
0 A 1 1
1 A 1 1
2 B 2 1
3 B 2 1
4 C 3 1
5 C 4 2
6 A 5 2
7 C 3 2
8 B 2 1
9 A 1 2
10 A 5 2
11 C 4 2
12 C 6 3
I am very close to accomplishing this. I can use:
df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})
Which produces the following output:
trackId
truthId
A 0 1.0
1 1.0
6 2.0
9 2.0
10 2.0
B 2 1.0
3 1.0
8 1.0
C 4 1.0
5 2.0
7 2.0
11 2.0
12 3.0
This is consistent with the documentation
However, it throws an error when I attempt to assign this output to a new column:
df['unique_Ids'] = df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})
I have used this workflow before and ideally the new column is put back into the original DateFrame with no issues (i.e. Split-Apply-Combine). How can I get it to work?
You need reset_index
df['Your']=(df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})).reset_index(level=0,drop=True)
df
Out[1162]:
trackId truthId Your
0 1 A 1.0
1 1 A 1.0
2 2 B 1.0
3 2 B 1.0
4 3 C 1.0
5 4 C 2.0
6 5 A 2.0
7 3 C 2.0
8 2 B 1.0
9 1 A 2.0
10 5 A 2.0
11 4 C 2.0
12 6 C 3.0
I am trying to join (merge) two dataframes based on values in each column.
For instance, to merge by values in columns in A and B.
So, having df1
A B C D L
0 4 3 1 5 1
1 5 7 0 3 2
2 3 2 1 6 4
And df2
A B E F L
0 4 3 4 5 1
1 5 7 3 3 2
2 3 8 5 5 5
I want to get a d3 with such structure
A B C D E F L
0 4 3 1 5 4 5 1
1 5 7 0 3 3 3 2
2 3 2 1 6 Nan Nan 4
3 3 8 Nan Nan 5 5 5
Can you, please help me? I've tried both merge and join methods but havent succeed.
UPDATE: (for updated DFs and new desired DF)
In [286]: merged = pd.merge(df1, df2, on=['A','B'], how='outer', suffixes=('','_y'))
In [287]: merged.L.fillna(merged.pop('L_y'), inplace=True)
In [288]: merged
Out[288]:
A B C D L E F
0 4 3 1.0 5.0 1.0 4.0 5.0
1 5 7 0.0 3.0 2.0 3.0 3.0
2 3 2 1.0 6.0 4.0 NaN NaN
3 3 8 NaN NaN 5.0 5.0 5.0
Data:
In [284]: df1
Out[284]:
A B C D L
0 4 3 1 5 1
1 5 7 0 3 2
2 3 2 1 6 4
In [285]: df2
Out[285]:
A B E F L
0 4 3 4 5 1
1 5 7 3 3 2
2 3 8 5 5 5
OLD answer:
you can use pd.merge(..., how='outer') method:
In [193]: pd.merge(a,b, on=['A','B'], how='outer')
Out[193]:
A B C D E F
0 4 3 1.0 5.0 4.0 5.0
1 5 7 0.0 3.0 3.0 3.0
2 3 2 1.0 6.0 NaN NaN
3 3 8 NaN NaN 5.0 5.0
Data:
In [194]: a
Out[194]:
A B C D
0 4 3 1 5
1 5 7 0 3
2 3 2 1 6
In [195]: b
Out[195]:
A B E F
0 4 3 4 5
1 5 7 3 3
2 3 8 5 5