I have a dataframe that consists of truthIds and trackIds:
truthId = ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'C', 'B', 'A', 'A', 'C', 'C']
trackId = [1, 1, 2, 2, 3, 4, 5, 3, 2, 1, 5, 4, 6]
df1 = pd.DataFrame({'truthId': truthId, 'trackId': trackId})
trackId truthId
0 1 A
1 1 A
2 2 B
3 2 B
4 3 C
5 4 C
6 5 A
7 3 C
8 2 B
9 1 A
10 5 A
11 4 C
12 6 C
I wish to add a column that calculates, for each unique truthId, the length of the set of unique tracksIds that have previously (i.e. from the top of the data to that row) been associated with it:
truthId trackId unique_Ids
0 A 1 1
1 A 1 1
2 B 2 1
3 B 2 1
4 C 3 1
5 C 4 2
6 A 5 2
7 C 3 2
8 B 2 1
9 A 1 2
10 A 5 2
11 C 4 2
12 C 6 3
I am very close to accomplishing this. I can use:
df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})
Which produces the following output:
trackId
truthId
A 0 1.0
1 1.0
6 2.0
9 2.0
10 2.0
B 2 1.0
3 1.0
8 1.0
C 4 1.0
5 2.0
7 2.0
11 2.0
12 3.0
This is consistent with the documentation
However, it throws an error when I attempt to assign this output to a new column:
df['unique_Ids'] = df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})
I have used this workflow before and ideally the new column is put back into the original DateFrame with no issues (i.e. Split-Apply-Combine). How can I get it to work?
You need reset_index
df['Your']=(df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})).reset_index(level=0,drop=True)
df
Out[1162]:
trackId truthId Your
0 1 A 1.0
1 1 A 1.0
2 2 B 1.0
3 2 B 1.0
4 3 C 1.0
5 4 C 2.0
6 5 A 2.0
7 3 C 2.0
8 2 B 1.0
9 1 A 2.0
10 5 A 2.0
11 4 C 2.0
12 6 C 3.0
Related
Below is a toy Pandas dataframe that has three columns: 'id' (group id), 'b' (for condition), and 'c' (target):
df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [3,4,5,'A',3,4,'A',1,'A',1,3,'A',2,3],
'c' : [1,0,1,10,1,1,20,1,10,0,1,20,1,1]})
print(df)
id b c
0 1 3 1
1 1 4 0
2 1 5 1
3 1 A 10
4 1 3 1
5 1 4 1
6 1 A 20
7 2 1 1
8 2 A 10
9 2 1 0
10 2 3 1
11 2 A 20
12 2 2 1
13 2 3 1
For each group, I want to replace the values in column 'c' with nan (i.e., np.nan) before the first occurrence of 'A' in column 'b'.
The desired output is the following:
desired_output_df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [3,4,5,'A',3,4,'A',1,'A',1,3,'A',2,3],
'c' : [np.nan,np.nan,np.nan,10,1,1,20,np.nan,10,0,1,20,1,1]})
print(desired_output_df)
id b c
0 1 3 NaN
1 1 4 NaN
2 1 5 NaN
3 1 A 10.0
4 1 3 1.0
5 1 4 1.0
6 1 A 20.0
7 2 1 NaN
8 2 A 10.0
9 2 1 0.0
10 2 3 1.0
11 2 A 20.0
12 2 2 1.0
13 2 3 1.0
I am able to get the index of the values of column c that I want to change using the following command: df.groupby('id').apply(lambda x: x.loc[:(x.b == 'A').idxmax()-1]).index. But the result is a "MultiIndex" and I can't seem to use it to replace the values.
MultiIndex([(1, 0),
(1, 1),
(1, 2),
(2, 7)],
names=['id', None])
Thanks in advance.
Try:
df['c'] = np.where(df.groupby('id').apply(lambda x: x['b'].eq('A').cumsum()) > 0, df['c'], np.nan)
print(df)
Prints:
id b c
0 1 3 NaN
1 1 4 NaN
2 1 5 NaN
3 1 A 10.0
4 1 3 1.0
5 1 4 1.0
6 1 A 20.0
7 2 1 NaN
8 2 A 10.0
9 2 1 0.0
10 2 3 1.0
11 2 A 20.0
12 2 2 1.0
13 2 3 1.0
Given the following dataframe
df = pd.DataFrame(data={'name': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'],
'lag': [1, 1, 1, 2, 2, 2, 2, 2, 2, 2],
'value': range(10)})
print(df)
lag name value
0 1 a 0
1 1 a 1
2 1 a 2
3 2 b 3
4 2 b 4
5 2 b 5
6 2 b 6
7 2 c 7
8 2 c 8
9 2 c 9
I am trying to shift values contained in column value to obtain the column expected_value, which is the shifted values grouped by column name and shifted by lag rows. I was thinking of using something like df['expected_value'] = df.groupby(['name', 'lag']).shift(), but I am not sure how to pass lag to the shift() function.
print(df)
lag name value expected_value
0 1 a 0 nan
1 1 a 1 0.0000
2 1 a 2 1.0000
3 2 b 3 nan
4 2 b 4 nan
5 2 b 5 3.0000
6 2 b 6 4.0000
7 2 c 7 nan
8 2 c 8 nan
9 2 c 9 7.0000
You can use GroupBy.transform here.
df.assign(expected_value = df.groupby(['name', 'lag'])['value'].
transform(lambda x: x.shift(x.name[1])))
name lag value expected_value
0 a 1 0 NaN
1 a 1 1 0.0
2 a 1 2 1.0
3 b 2 3 NaN
4 b 2 4 NaN
5 b 2 5 3.0
6 b 2 6 4.0
7 c 2 7 NaN
8 c 2 8 NaN
9 c 2 9 7.0
You can do with an apply:
df['new_val'] = (df.groupby('name')
.apply(lambda x: x['value'].shift(x['lag'].iloc[0]))
.reset_index('name',drop=True)
)
Output:
name lag value new_val
0 a 1 0 NaN
1 a 1 1 0.0
2 a 1 2 1.0
3 b 2 3 NaN
4 b 2 4 NaN
5 b 2 5 3.0
6 b 2 6 4.0
7 c 2 7 NaN
8 c 2 8 NaN
9 c 2 9 7.0
I have the following DataFrame:
>>>> df = pd.DataFrame(data={
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'value': [0, 2, 3, 4, 0, 3, 2, 3, 0]})
>>> df
type value
0 A 0
1 A 2
2 A 3
3 B 4
4 B 0
5 B 3
6 C 2
7 C 3
8 C 0
What I need to accomplish is the following: for each type, trace the cumulative count of non-zero values but starting from zero each time a 0-value is encountered.
type value cumcount
0 A 0 NaN
1 A 2 1
2 A 3 2
3 B 4 1
4 B 0 NaN
5 B 3 1
6 C 2 1
7 C 3 2
8 C 0 NaN
Idea is create consecutive groups and filter out non 0 values, last assign to new column with filter:
m = df['value'].eq(0)
g = m.ne(m.shift()).cumsum()[~m]
df.loc[~m, 'new'] = df.groupby(['type',g]).cumcount().add(1)
print (df)
type value new
0 A 0 NaN
1 A 2 1.0
2 A 3 2.0
3 B 4 1.0
4 B 0 NaN
5 B 3 1.0
6 C 2 1.0
7 C 3 2.0
8 C 0 NaN
For pandas 0.24+ is possible use Nullable integer data type:
df['new'] = df['new'].astype('Int64')
print (df)
type value new
0 A 0 NaN
1 A 2 1
2 A 3 2
3 B 4 1
4 B 0 NaN
5 B 3 1
6 C 2 1
7 C 3 2
8 C 0 NaN
i have dataframe like below
A B C D E F G H G H I J K
1 2 3 4 5 6 7 8 9 10 11 12 13
and i want result like this
A B C D E F G H
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 9
1 2 3 4 5 6 7 10
1 2 3 4 5 6 7 11
1 2 3 4 5 6 7 12
1 2 3 4 5 6 7 13
like a result column 'G~K' is under column 'H'
how can i do this?
You need to adjust your columns by using cummax , then after melt, we create additional key with cumcount, then just do reshape here, I am using unstack , you can using pivot , pivot_table
s=pd.Series(df.columns)
s[(s>='H').cummax()==1]='H'
df.columns=s
df=df.melt()
yourdf=df.set_index(['variable',df.groupby('variable').cumcount()]).\
value.unstack(0).ffill()
yourdf
variable A B C D E F G H
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
1 1.0 2.0 3.0 4.0 5.0 6.0 7.0 9.0
2 1.0 2.0 3.0 4.0 5.0 6.0 7.0 10.0
3 1.0 2.0 3.0 4.0 5.0 6.0 7.0 11.0
4 1.0 2.0 3.0 4.0 5.0 6.0 7.0 12.0
5 1.0 2.0 3.0 4.0 5.0 6.0 7.0 13.0
I hope this would give you some help
import pandas as pd
df = pd.DataFrame([list(range(1,14))])
df.columns = ('A','B','C','D','E','F','G','H','G','H','I','J','K')
print('starting data frame:')
print(df)
df1 = df.iloc[:,0:7]
df1 = df1.append([df1]*(len(df.iloc[:,7:].T)-1))
df1.insert(df1.shape[1],'H',list(df.iloc[:,7:].values[0]))
print('result:')
print(df1)
letters = list("ABCDEFGHIJKLM")
df = pd.DataFrame([np.arange(1, len(letters) + 1)], columns=letters)
df = pd.concat([df.iloc[:, :7]] * (len(letters) - 7)).assign(H=df[letters[7:]].values[0])
df = df.reset_index(drop=True)
df
gives you
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 1 2 3 4 5 6 7 9
2 1 2 3 4 5 6 7 10
3 1 2 3 4 5 6 7 11
4 1 2 3 4 5 6 7 12
5 1 2 3 4 5 6 7 13
Your data has some duplicates in columns name, so melt will fail. However, you could change columns name and then apply melt
In [166]: df
Out[166]:
A B C D E F G H G H I J K
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Duplicates in column name 'G' and 'H'. Just change those to 'GG', 'HH'. Finally, apply melt
In [167]: df.columns = ('A','B','C','D','E','F','G','H','GG','HH','I','J','K')
In [168]: df
Out[168]:
A B C D E F G H GG HH I J K
0 1 2 3 4 5 6 7 8 9 10 11 12 13
In [169]: df.melt(id_vars=df.columns.tolist()[0:7], value_name='H').drop('variable', 1)
Out[169]:
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 1 2 3 4 5 6 7 9
2 1 2 3 4 5 6 7 10
3 1 2 3 4 5 6 7 11
4 1 2 3 4 5 6 7 12
5 1 2 3 4 5 6 7 13
As an extension of my previous question, I would like take a DataFrame like the one below and find the correct row from which to pull data from column C and place it into column D based upon the following criteria:
B_new = 2*A_old -B_old, ie. the new row needs to have a B equal to the following result from the old row: 2*A - B.
Where A is the same, ie. A in the new row should have the same value as the old row.
Any values not found should use a NaN result
Code:
import pandas as pd
a = [2,2,2,3,3,3,3]
b = [1,2,3,1,3,4,5]
c = [0,1,2,3,4,5,6]
df = pd.DataFrame({'A': a , 'B': b, 'C':c})
print(df)
A B C
0 2 1 0
1 2 2 1
2 2 3 2
3 3 1 3
4 3 3 4
5 3 4 5
6 3 5 6
Desired output:
A B C D
0 2 1 0 2.0
1 2 2 1 1.0
2 2 3 2 0.0
3 3 1 3 6.0
4 3 3 4 4.0
5 3 4 5 NaN
6 3 5 6 3.0
Based upon the solutions in my previous question, I've come up with a method that uses a for loop to move thru each unique value of A:
for i in df.A.unique():
mapping = dict(df[df.A==i][['B', 'C']].values)
df.loc[df.A==i,'D'] = (2 * df[df.A==i]['A'] - df[df.A==i]['B']).map(mapping)
However, this seem clunky and I suspect there is a better way that doesn't make use of for loops, which from my prior experience tend to be slow.
Question:
What's the fastest way to accomplish this transfer of data within the DataFrame?
You could
In [370]: (df[['A', 'C']].assign(B=2*df.A - df.B)
.merge(df, how='left', on=['A', 'B'])
.assign(B=df.B)
.rename(columns={'C_x': 'C', 'C_y': 'D'}) )
Out[370]:
A C B D
0 2 0 1 2.0
1 2 1 2 1.0
2 2 2 3 0.0
3 3 3 1 6.0
4 3 4 3 4.0
5 3 5 4 NaN
6 3 6 5 3.0
Details:
In [372]: df[['A', 'C']].assign(B=2*df.A - df.B)
Out[372]:
A C B
0 2 0 3
1 2 1 2
2 2 2 1
3 3 3 5
4 3 4 3
5 3 5 2
6 3 6 1
In [373]: df[['A', 'C']].assign(B=2*df.A - df.B).merge(df, how='left', on=['A', 'B'])
Out[373]:
A C_x B C_y
0 2 0 3 2.0
1 2 1 2 1.0
2 2 2 1 0.0
3 3 3 5 6.0
4 3 4 3 4.0
5 3 5 2 NaN
6 3 6 1 3.0