I'm trying to replace values in a Pandas data frame, based on certain criteria on multiple columns. For a single column criteria this can be done very elegantly with a dictionary (e.g. Remap values in pandas column with a dict):
import pandas as pd
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2}, 'col2': {0:10, 1:20, 2:20}})
rdict = {1:'a', 2:'b'}
df2 = df.replace({"col1": rdict})
Input df:
col1 col2
0 1 10
1 1 20
2 2 20
Resulting df2:
col1 col2
0 a 10
1 a 20
2 b 20
I'm trying to extend this to criteria over multiple columns (e.g. where col1==1, col2==10 -> replace). For a single criteria this can be done like:
df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'c'
Which results in a df3:
col1 col2
0 c 10
1 1 20
2 2 20
My real life problem has a large number of criteria, which would involve a large number of df3.loc[((criteria1)&(criteria2)), column] = value calls, which is far less elegant the the replacement using a dictionary as a "lookup table". Is it possible to extend the elegant solution (df2 = df.replace({"col1": rdict})) to a setup where values in one column are replaced by criteria based on multiple columns?
An example of what I'm trying to achieve (although in my real life case the number of criteria is a lot larger):
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})
df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'a'
df3.loc[((df['col1']==1)&(df['col2']==20)), 'col1'] = 'b'
df3.loc[((df['col1']==2)&(df['col2']==10)), 'col1'] = 'c'
df3.loc[((df['col1']==2)&(df['col2']==20)), 'col1'] = 'd'
Input df:
0 1 10
1 1 20
2 2 10
3 2 20
Resulting df3:
col1 col2
0 a 10
1 b 20
2 c 10
3 d 20
We can use merge.
Suppose your df looks like
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2, 4:2, 5:1}, 'col2': {0:10, 1:20, 2:10, 3:20, 4: 20, 5:10}})
col1 col2
0 1 10
1 1 20
2 2 10
3 2 20
4 2 20
5 1 10
And your conditional replacement can be represented as another dataframe:
df_replace
col1 col2 val
0 1 10 a
1 1 20 b
2 2 10 c
3 2 20 d
(As OP (Bart) pointed out, you can save this in a csv file.)
Then you can use
df = df.merge(df_replace, on=["col1", "col2"], how="left")
col1 col2 val
0 1 10 a
1 1 20 b
2 2 10 c
3 2 20 d
4 2 20 d
5 1 10 a
Then you just need to drop col1.
As MaxU pointed out, there could be rows that does not get replaced and resulting in NaN. We can use a line like
df["val"] = df["val"].combine_first(df["col1"])
to fill in values from col1 if the resulting values after merge is NaN.
Demo:
Source DF:
In [120]: df
Out[120]:
col1 col2
0 1 10
1 1 10
2 1 20
3 1 20
4 2 10
5 2 20
6 3 30
Conditions & Replacements DF:
In [121]: cond
Out[121]:
col1 col2 repl
1 1 20 b
2 2 10 c
0 1 10 a
3 2 20 d
Solution:
In [121]: res = df.merge(cond, how='left')
yields:
In [122]: res
Out[122]:
col1 col2 repl
0 1 10 a
1 1 10 a
2 1 20 b
3 1 20 b
4 2 10 c
5 2 20 d
6 3 30 NaN # <-- NOTE
In [123]: res['col1'] = res.pop('repl').fillna(res['col1'])
In [124]: res
Out[124]:
col1 col2
0 a 10
1 a 10
2 b 20
3 b 20
4 c 10
5 d 20
6 3 30
This method is likely to be more efficient than pandas functionality, as it relies on numpy arrays and dictionary mappings.
import pandas as pd
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})
rdict = {(1, 10): 'a', (1, 20): 'b', (2, 10): 'c', (2, 20): 'd'}
df['col1'] = list(map(rdict.get, [(x[0], x[1]) for x in df1[['c1', 'c2']].values]))
Related
Is there any way to remove from first DataFrame all rows which can be found in second DataFrame and add rows which are exclusive only in second DataFrame (= XOR)? Here's a twist: the first DataFrame has one column that shall be ignored during comparison.
import pandas as pd
df1 = pd.DataFrame({'col1': [1,2,3],
'col2': [4,5,6],
'spec': ['A','B','C']})
df2 = pd.DataFrame({'col1': [1,9],
'col2': [4,9]})
result = pd.DataFrame({'col1': [2,3,9],
'col2': [5,6,9],
'spec': ['B','C','df2']})
df1 = df1.astype(str)
df2 = df1.astype(str)
This is analogical to UNION (not UNION ALL) operation.
Combine
col1 col2 spec
0 1 4 A
1 2 5 B
2 3 6 C
and
col1 col2
0 1 4
1 9 9
to
col1 col2 spec
1 2 5 B
2 3 6 C
1 9 9 df2
You could concatenate and drop duplicates:
out = (pd.concat((df1, df2.assign(spec='df2')))
.drop_duplicates(subset=['col1','col2'], keep=False))
or filter out the common rows and concatenate:
out = pd.concat((df1[~df1[['col1','col2']].isin(df2[['col1','col2']]).all(axis=1)],
df2[~df2.isin(df1[['col1','col2']]).all(axis=1)].assign(spec='df2')))
Output:
col1 col2 spec
1 2 5 B
2 3 6 C
1 9 9 df2
I have the following sample input data:
import pandas as pd
df = pd.DataFrame({'col1': ['x', 'y', 'z'], 'col2': [1, 2, 3], 'col3': ['a', 'a', 'b']})
I would like to sort and group by col3 while interleaving the summaries on top of the corresponding group in col1 and get the following output:
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
I can of course do the part:
df.sort_values(by=['col3']).groupby(by=['col3']).sum()
col2
col3
a 3
b 3
but I am not sure how to interleave the group labels on top of col1.
Use custom function for top1 row for each group:
def f(x):
return pd.DataFrame({'col1': x.name, 'col2': x['col2'].sum()}, index=[0]).append(x)
df = (df.sort_values(by=['col3'])
.groupby(by=['col3'], group_keys=False)
.apply(f)
.drop('col3', 1)
.reset_index(drop=True))
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
More performant solution is use GroupBy.ngroup for indices, aggregate sum amd last join values by concat with only stable sorting by mergesort:
df = df.sort_values(by=['col3'])
df1 = df.groupby(by=['col3'])['col2'].sum().rename_axis('col1').reset_index()
df2 = df.set_index(df.groupby(by=['col3']).ngroup())
df = pd.concat([df1, df2]).sort_index(kind='mergesort', ignore_index=True).drop('col3', 1)
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
What about:
(df.melt(id_vars='col2')
.rename(columns={'value': 'col1'})
.groupby('col1').sum()
.reset_index()
)
output:
col1 col2
0 a 3
1 b 3
2 x 1
3 y 2
4 z 3
def function1(dd:pd.DataFrame):
df.loc[dd.index.min()-0.5,['col1','col2']]=[dd.name,dd.col2.sum()]
df.groupby('col3').apply(function1).pipe(lambda dd:df.sort_index(ignore_index=True)).drop('col3',axis=1)
output
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
or use pandasql library
def function1(dd:pd.DataFrame):
return dd.sql("select '{}' as col1,{} as col2 union select col1,col2 from self".format(dd.name,dd.col2.sum()))
df.groupby('col3').apply(function1).reset_index(drop=False)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
It's similar to this question, but with an additional level of complexity.
In my case, I have a the following dataframe:
import pandas as pd
df = pd.DataFrame({'col1': list('aaabbbabababbaaa'), 'col2': list('cdddccdsssssddcd'), 'val': range(0, 16)})
output:
col1 col2 val
0 a c 0
1 a d 1
2 a d 2
3 b d 3
4 b c 4
5 b c 5
6 a d 6
7 b s 7
8 a s 8
9 b s 9
10 a s 10
11 b s 11
12 b d 12
13 a d 13
14 a c 14
15 a d 15
My goal is to select random groups of groupby(['col1', 'col2']) such that each value of col1 will be selected only once.
This can be executed by the following code:
g = df.groupby('col1')
indexes = []
for _, group in g:
g_ = group.groupby('col2')
a = np.arange(g_.ngroups)
np.random.shuffle(a)
indexes.extend(group[g_.ngroup().isin(a[:1])].index.tolist())
output:
print(df[df.index.isin(indexes)])
col1 col2 val
4 b c 4
5 b c 5
8 a s 8
10 a s 10
However, I'm looking for a more concise and pythonic way to solve this.
Another option is to sufffle your two columns with sample and drop_duplicates by col1, so that you keep only one couple per col1 value. then merge the result to df to select all the rows with these couples.
print(df.merge(df[['col1','col2']].sample(frac=1).drop_duplicates('col1')))
col1 col2 val
0 b s 7
1 b s 9
2 b s 11
3 a s 8
4 a s 10
or with groupby and sample a bit the same idea but to select only one row per col1 value with merge after
df.merge(df[['col1','col2']].groupby('col1').sample(n=1))
EDIT: to get both the selected rows and the others rows, then you can use the parameter indicator in the merge and do a left merge. then query each separately:
m = df.merge(df[['col1','col2']].groupby('col1').sample(1), how='left', indicator=True)
print(m)
select_ = m.query('_merge=="both"')[df.columns]
print(select_)
comp_ = m.query('_merge=="left_only"')[df.columns]
print(comp_)
How to create a flag to include/exclude rows from a df in python. Example - flag = 1 will include all rows from the df and flag = 0 will include all rows except col2 = 'a'
row col1 col2
1 1 a
2 2 a
3 3 b
4 4 c
Expected output - flag =1 should populate all rows, flag = 0 should only populate rows 3,4
You can create a set of rules with strings in a dictionary and apply them using the df.query method.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({
"col1": np.random.randint(4, size=10),
"col2": np.random.choice(list("abcd"), 10)})
print(df)
col1 col2
0 0 b
1 3 c
2 1 a
3 0 d
4 3 c
5 3 a
6 3 a
7 3 a
8 1 c
9 3 b
Create a dictionary of flag -> rules as well as a function to apply them:
def flag_filter(df, flag, rule_dict):
rule = rule_dict[flag]
if rule is None:
return df
return df.query(rule)
rule_dict = {
0: None, # flag 0 returns the dataframe unfiltered
1: "col2 != 'a'", # flag 1 = where column2 is not equal to "a"
2: "col1 == 3 & col2 == ['b', 'c']" # flag 2 = where column1 equals 3 AND column2 equals either "b" or "c"
}
Flag 0
subset = flag_filter(df, 0, rule_dict)
print(subset)
col1 col2
0 0 b
1 3 c
2 1 a
3 0 d
4 3 c
5 3 a
6 3 a
7 3 a
8 1 c
9 3 b
Flag 1
subset = flag_filter(df, 1, rule_dict)
print(subset)
col1 col2
0 0 b
1 3 c
3 0 d
4 3 c
8 1 c
9 3 b
Flag 2
subset = flag_filter(df, 2, rule_dict)
print(subset)
col1 col2
1 3 c
4 3 c
9 3 b
We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4