Question: How do you group a df based on a variable, make a computation using a for loop?
The task is to make a conditional computation based on the value in a column. But the computational constants are dependent upon the value in the reference column. Given this df:
In [55]: df = pd.DataFrame({
...: 'col1' : ['A', 'A', 'B', np.nan, 'D', 'C'],
...: 'col2' : [2, 1, 9, 8, 7, 4],
...: 'col3': [0, 1, 9, 4, 2, 3],
...: })
In [56]: df
Out[56]:
col1 col2 col3
0 A 2 0
1 A 1 1
2 B 9 9
3 NaN 8 4
4 D 7 2
5 C 4 3
I've used the solution here to insert a 'math' column that takes the balance from col3 and adds 10. But now I want to iterate over a list to set the computational variable dependent upon the values in col1. Here's the result:
In [57]: items = ['A', 'D']
In [58]: for item in items:
...: df.loc[:, 'math'] = df.loc[df['col1'] == item, 'col3']
...:
In [59]: df
Out[59]:
col1 col2 col3 math
0 A 2 0 NaN
1 A 1 1 NaN
2 B 9 9 NaN
3 NaN 8 4 NaN
4 D 7 2 2.0
5 C 4 3 NaN
The obvious issue is that the df is over written on each iteration. The math column for index 0 and 1 computed values on the first iteration, but they are removed on the second iteration. The resulting df only considers the last element of the list.
I could go through and add coding to iterate through each index value - but that seems more pathetic than pythonic.
Expected Output for the .mul() example
In [100]: df
Out[100]:
col1 col2 col3 math
0 A 2 0 0.0
1 A 1 1 10.0
2 B 9 9 NaN
3 NaN 8 4 NaN
4 D 7 2 20.0
5 C 4 3 NaN
The problem with your current method is the output of each subsequent iteration overwrites the output of the one before it. So you'd end up with output for just the last item and nothing more.
Select all rows with elements in items and assign, same as you did before.
df['math'] = df.loc[df.col1.isin(items), 'col3'] * 10
Or,
df['math'] = df.query("col1 in #items").col3 * 10
Or even,
df['math'] = df.col3.where(df.col1.isin(items)) * 10
df
col1 col2 col3 math
0 A 2 0 0.0
1 A 1 1 10.0
2 B 9 9 NaN
3 NaN 8 4 NaN
4 D 7 2 20.0
5 C 4 3 NaN
The reason why you fail with assign , cause in each for loop you are assign a Math with new value , like below which will only show the last one and present to the result after the for loop
0 0.0
1 10.0
2 NaN
3 NaN
4 NaN
5 NaN
Name: col3, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 20.0
5 NaN
Name: col3, dtype: float64
You can do it with below
df.loc[df.col1.isin(items),'math']=df.col3*10
df
Out[85]:
col1 col2 col3 math
0 A 2 0 0.0
1 A 1 1 10.0
2 B 9 9 NaN
3 NaN 8 4 NaN
4 D 7 2 20.0
5 C 4 3 NaN
Related
Assume, I have a data frame such as
import pandas as pd
df = pd.DataFrame({'visitor':['A','B','C','D','E'],
'col1':[1,2,3,4,5],
'col2':[1,2,4,7,8],
'col3':[4,2,3,6,1]})
visitor
col1
col2
col3
A
1
1
4
B
2
2
2
C
3
4
3
D
4
7
6
E
5
8
1
For each row/visitor, (1) First, if there are any identical values, I would like to keep the 1st value of each row then replace the rest of identical values in the same row with NULL such as
visitor
col1
col2
col3
A
1
NULL
4
B
2
NULL
NULL
C
3
4
NULL
D
4
7
6
E
5
8
1
Then (2) keep rows/visitors with more than 1 value such as
Final Data Frame
visitor
col1
col2
col3
A
1
NULL
4
C
3
4
NULL
D
4
7
6
E
5
8
1
Any suggestions? many thanks
We can use series.duplicated along the columns axis to identify the duplicates, then mask the duplicates using where and filter the rows where the sum of non-duplicated values is greater than 1
s = df.set_index('visitor')
m = ~s.apply(pd.Series.duplicated, axis=1)
s.where(m)[m.sum(1).gt(1)]
col1 col2 col3
visitor
A 1 NaN 4.0
C 3 4.0 NaN
D 4 7.0 6.0
E 5 8.0 1.0
Let us try mask with pd.Series.duplicated, then dropna with thresh
out = df.mask(df.apply(pd.Series.duplicated,1)).dropna(thresh = df.shape[1]-1)
Out[321]:
visitor col1 col2 col3
0 A 1 NaN 4.0
2 C 3 4.0 NaN
3 D 4 7.0 6.0
4 E 5 8.0 1.0
I have a scenario where I have an existing dataframe and I have a new dataframe which contains rows which might be in the existing frame but might also have new rows. I have struggled to find a reliable way to drop these existing rows from the new dataframe by comparing it with the existing dataframe.
I've done my homework. The solution seems to be to use isin(). However, I find that this has hidden dangers. In particular:
pandas get rows which are NOT in other dataframe
Pandas cannot compute isin with a duplicate axis
Pandas promotes int to float when filtering
Is there a way to reliably filter out rows from one dataframe based on membership/containment in another dataframe? A simple usecase which doesn't capture corner cases is shown below. Note that I want to remove rows in new that are in existing so that new only contains rows not in existing. A simpler problem of updating existing with new rows from new can be achieved with pd.merge() + DataFrame.drop_duplicates()
In [53]: df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
...: df2 = pd.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
In [54]: df1
Out[54]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
In [55]: df2
Out[55]:
col1 col2
0 1 10
1 2 11
2 3 12
In [56]: df1[~df1.isin(df2)]
Out[56]:
col1 col2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 13.0
4 5.0 14.0
In [57]: df1[~df1.isin(df2)].dropna()
Out[57]:
col1 col2
3 4.0 13.0
4 5.0 14.0
We can use DataFrame.merge with indicator = True + DataFrame.query and DataFrame.drop
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1) )
print(df_filtered)
col1 col2
3 4 13
4 5 14
if now for example we change a value of row 0:
df1.iat[0,0]=3
row 0 is no longer filtered
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1) )
print(df_filtered)
col1 col2
0 3 10
3 4 13
4 5 14
Step by step
df_filtered=( df1.merge(df2,how='outer',indicator=True)
)
print(df_filtered)
col1 col2 _merge
0 3 10 left_only
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 1 10 right_only
df_filtered=( df1.merge(df2,how='outer',indicator=True).query("_merge == 'left_only'")
)
print(df_filtered)
col1 col2 _merge
0 3 10 left_only
3 4 13 left_only
4 5 14 left_only
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1)
)
print(df_filtered)
col1 col2
0 3 10
3 4 13
4 5 14
You may try Series isin. It is independent from index. I.e, It only checks on values. You just need to convert columns of each dataframe to series of tuples to create mask
s1 = df1.agg(tuple, axis=1)
s2 = df2.agg(tuple, axis=1)
df1[~s1.isin(s2)]
Out[538]:
col1 col2
3 4 13
4 5 14
I want expanding mean of col2 based on groupby('col1'), but I want the mean to not include the row itself (just the rows above it)
dummy = pd.DataFrame({"col1": ["a",'a','a','b','b','b','c','c'],"col2":['1','2','3','4','5','6','7','8'] }, index=list(range(8)))
print(dummy)
dummy['one_liner'] = dummy.groupby('col1').col2.shift().expanding().mean().reset_index(level=0, drop=True)
dummy['two_liner'] = dummy.groupby('col1').col2.shift()
dummy['two_liner'] = dummy.groupby('col1').two_liner.expanding().mean().reset_index(level=0, drop=True)
print(dummy)
---------------------------
here is result of first print statement:
col1 col2
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
6 c 7
7 c 8
here is result of the second print:
col1 col2 one_liner two_liner
0 a 1 NaN NaN
1 a 2 1.000000 1.0
2 a 3 1.500000 1.5
3 b 4 1.500000 NaN
4 b 5 2.333333 4.0
5 b 6 3.000000 4.5
6 c 7 3.000000 NaN
7 c 8 3.800000 7.0
I would have thought their results would be identical.
two_liner is the expected result. one_liner mixes numbers in between groups.
It took a long time to figure out this solution, can anyone explain the logic? Why does one_liner not give expected results?
You are looking for expanding().mean() and shift() within the groupby():
groups = df.groupby('col1')
df['one_liner'] = groups.col2.apply(lambda x: x.expanding().mean().shift())
df['two_liner'] = groups.one_liner.apply(lambda x: x.expanding().mean().shift())
Output:
col1 col2 one_liner two_liner
0 a 1 NaN NaN
1 a 2 1.0 NaN
2 a 3 1.5 1.0
3 b 4 NaN NaN
4 b 5 4.0 NaN
5 b 6 4.5 4.0
6 c 7 NaN NaN
7 c 8 7.0 NaN
Explanation:
(dummy.groupby('col1').col2.shift() # this shifts col2 within the groups
.expanding().mean() # this ignores the grouping and expanding on the whole series
.reset_index(level=0, drop=True) # this is not really important
)
So that the above chained command is equivalent to
s1 = dummy.groupby('col1').col2.shift()
s2 = s1.expanding.mean()
s3 = s2.reset_index(level=0, drop=True)
As you can see, only s1 considers the grouping by col1.
I have a pandas DataFrame that looks similar to the following...
>>> df = pd.DataFrame({
... 'col1':['A','C','B','A','B','C','A'],
... 'col2':[np.nan,1.,np.nan,1.,1.,np.nan,np.nan],
... 'col3':[0,1,9,4,2,3,5],
... })
>>> df
col1 col2 col3
0 A NaN 0
1 C 1.0 1
2 B NaN 9
3 A 1.0 4
4 B 1.0 2
5 C NaN 3
6 A NaN 5
What I would like to do is group the rows of col1 by value and then update any NaN values in col2 to increment in value by 1 based on the last highest value of that group in col1.
So that my expected results would look like the following...
>>> df
col1 col2 col3
0 A 1.0 4
1 A 2.0 0
2 A 3.0 5
3 B 1.0 2
4 B 2.0 9
5 C 1.0 1
6 C 2.0 3
I believe I can use something like groupby on col1 though I'm unsure how to increment the value in col2 based on the last highest value of the group from col1. I've tried the following, but instead of incrementing the value of col1 it updates the value to all 1.0 and adds an additional column...
>>> df1 = df.groupby(['col1'], as_index=False).agg({'col2': 'min'})
>>> df = pd.merge(df1, df, how='left', left_on=['col1'], right_on=['col1'])
>>> df
col1 col2_x col2_y col3
0 A 1.0 NaN 0
1 A 1.0 1.0 1
2 A 1.0 NaN 5
3 B 1.0 NaN 9
4 B 1.0 1.0 4
5 C 1.0 1.0 2
6 C 1.0 NaN 3
Use GroupBy.cumcount only for rows with missing values, add maximum value per group with GroupBy.transform and max and last replace by original values by fillna:
df = pd.DataFrame({
'col1':['A','C','B','A','B','B','B'],
'col2':[np.nan,1.,np.nan,1.,3.,np.nan, 0],
'col3':[0,1,9,4,2,3,4],
})
print (df)
col1 col2 col3
0 A NaN 0
1 C 1.0 1
2 B NaN 9
3 A 1.0 4
4 B 3.0 2
5 B NaN 3
6 B 0.0 4
df = df.sort_values(['col1','col2'], na_position='last')
s = df.groupby('col1')['col2'].transform('max')
df['new'] = (df[df['col2'].isna()]
.groupby('col1')
.cumcount()
.add(1)
.add(s)
.fillna(df['col2']).astype(int))
print (df)
col1 col2 col3 new
3 A 1.0 4 1
0 A NaN 0 2
6 B 0.0 4 0
4 B 3.0 2 3
2 B NaN 9 4
5 B NaN 3 5
1 C 1.0 1 1
Another way:
df['col2_new'] = df.groupby('col1')['col2'].apply(lambda x: x.replace(np.nan, x.value_counts().index[0]+1))
df = df.sort_values('col1')
This is probably a stupid question, but I have been trying for a while and I can't seem to get it to work.
I have a dataframe:
df1 = pd.DataFrame({'Type': ['A','A', 'B', 'F', 'C', 'G', 'A', 'E'], 'Other': [999., 999., 999., 999., 999., 999., 999., 999.]})
I now want to create a new column based on the column Type. For this I have second dataframe:
df2 = pd.DataFrame({'Type':['A','B','C','D','E','F', 'G'],'Value':[1, 1, 2, 3, 4, 4, 5]})
that I am using as a lookup table.
When I try something like:
df1.apply(lambda x: df2.Value[df2.Type == x['Type']],axis=1)
I get a matrix instead of a single column:
Out[21]:
0 1 2 4 5 6
0 1 NaN NaN NaN NaN NaN
1 1 NaN NaN NaN NaN NaN
2 NaN 1 NaN NaN NaN NaN
3 NaN NaN NaN NaN 4 NaN
4 NaN NaN 2 NaN NaN NaN
5 NaN NaN NaN NaN NaN 5
6 1 NaN NaN NaN NaN NaN
7 NaN NaN NaN 4 NaN NaN
What I want however is this:
0
0 1
1 1
2 1
3 4
4 2
5 5
6 1
7 4
What am I doing wrong?
You can use map to achieve this:
In [62]:
df1['Type'].map(df2.set_index('Type')['Value'],na_action='ignore')
Out[62]:
0 1
1 1
2 1
3 4
4 2
5 5
6 1
7 4
Name: Type, dtype: int64
If you modified your apply attempt to the following then it would've worked:
In [70]:
df1['Type'].apply(lambda x: df2.loc[df2.Type == x,'Value'].values[0])
Out[70]:
0 1
1 1
2 1
3 4
4 2
5 5
6 1
7 4
Name: Type, dtype: int64
If we look at what you tried:
df1.apply(lambda x: df2.Value[df2.Type == x['Type']],axis=1)
this is trying to compare the 'type' and return the 'value' the problem here is that you're returning a Series with the index of df2, this is confusing pandas and causing the matrix to be returned. You can see this if we hard code 'B' as an example:
In [75]:
df2.Value[df2.Type == 'B']
Out[75]:
1 1
Name: Value, dtype: int64