Merge Pandas Dataframe based on boolean function - python

I am looking for an efficient way to merge two pandas data frames based on a function that takes as input columns from both data frames and returns True or False. E.g. Assume I have the following "tables":
import pandas as pd
df_1 = pd.DataFrame(data=[1, 2, 3])
df_2 = pd.DataFrame(data=[4, 5, 6])
def validation(a, b):
return ((a + b) % 2) == 0
I would like to join df1 and df2 on each row where the sum of the first column is an even number. The resulting table would be
1 5
df_3 = 2 4
2 6
3 5
Please think of it as a general problem not as a task to return just df_3. The solution should accept any function that validates a combination of columns and return True or False.
THX Lazloo

You can do with merge on parity:
(df_1.assign(parity=df_1[0]%2)
.merge(df_2.assign(parity=df_2[0]%2), on='dummy')
.drop('parity', axis=1)
)
output:
0_x 0_y
0 1 5
1 3 5
2 2 4
3 2 6

You can use broadcasting, or the outer functions, to compare all rows. You'll run into issues as the length becomes large.
import pandas as pd
import numpy as np
def validation(a, b):
"""a,b : np.array"""
arr = np.add.outer(a, b) # How to combine rows
i,j = np.where(arr % 2 == 0) # Condition
return pd.DataFrame(np.stack([a[i], b[j]], axis=1))
validation(df_1[0].to_numpy(), df_2[0].to_numpy())
0 1
0 1 5
1 2 4
2 2 6
3 3 5
In this particular case you might leverage the fact that even numbers maintain parity when added to even numbers, and odd numbers change parity when added to odd numbers, so define that column and merge on that.
df_1['parity'] = df_1[0]%2
df_2['parity'] = df_2[0]%2
df_3 = df_1.merge(df_2, on='parity')
0_x parity 0_y
0 1 1 5
1 3 1 5
2 2 0 4
3 2 0 6

This is a basic solution but not very efficient if you are working on large dataframes
df_1.index *= 0
df_2.index *= 0
df = df_1.join(df_2, lsuffix='_2')
df = df[df.sum(axis=1) % 2 == 0]
Edit,
here is a better solution
df_1.index = df_1.iloc[:,0] % 2
df_2.index = df_2.iloc[:,0] % 2
df = df_1.join(df_2, lsuffix='_2')

Related

How can pandas set up filtering with an uncertain number of conditions and forms? [duplicate]

Most operations in pandas can be accomplished with operator chaining (groupby, aggregate, apply, etc), but the only way I've found to filter rows is via normal bracket indexing
df_filtered = df[df['column'] == value]
This is unappealing as it requires I assign df to a variable before being able to filter on its values. Is there something more like the following?
df_filtered = df.mask(lambda x: x['column'] == value)
I'm not entirely sure what you want, and your last line of code does not help either, but anyway:
"Chained" filtering is done by "chaining" the criteria in the boolean index.
In [96]: df
Out[96]:
A B C D
a 1 4 9 1
b 4 5 0 2
c 5 5 1 0
d 1 3 9 6
In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
A B C D
d 1 3 9 6
If you want to chain methods, you can add your own mask method and use that one.
In [90]: def mask(df, key, value):
....: return df[df[key] == value]
....:
In [92]: pandas.DataFrame.mask = mask
In [93]: df = pandas.DataFrame(np.random.randint(0, 10, (4,4)), index=list('abcd'), columns=list('ABCD'))
In [95]: df.ix['d','A'] = df.ix['a', 'A']
In [96]: df
Out[96]:
A B C D
a 1 4 9 1
b 4 5 0 2
c 5 5 1 0
d 1 3 9 6
In [97]: df.mask('A', 1)
Out[97]:
A B C D
a 1 4 9 1
d 1 3 9 6
In [98]: df.mask('A', 1).mask('D', 6)
Out[98]:
A B C D
d 1 3 9 6
Filters can be chained using a Pandas query:
df = pd.DataFrame(np.random.randn(30, 3), columns=['a','b','c'])
df_filtered = df.query('a > 0').query('0 < b < 2')
Filters can also be combined in a single query:
df_filtered = df.query('a > 0 and 0 < b < 2')
The answer from #lodagro is great. I would extend it by generalizing the mask function as:
def mask(df, f):
return df[f(df)]
Then you can do stuff like:
df.mask(lambda x: x[0] < 0).mask(lambda x: x[1] > 0)
Since version 0.18.1 the .loc method accepts a callable for selection. Together with lambda functions you can create very flexible chainable filters:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.loc[lambda df: df.A == 80] # equivalent to df[df.A == 80] but chainable
df.sort_values('A').loc[lambda df: df.A > 80].loc[lambda df: df.B > df.A]
If all you're doing is filtering, you can also omit the .loc.
pandas provides two alternatives to Wouter Overmeire's answer which do not require any overriding. One is .loc[.] with a callable, as in
df_filtered = df.loc[lambda x: x['column'] == value]
the other is .pipe(), as in
df_filtered = df.pipe(lambda x: x.loc[x['column'] == value])
I offer this for additional examples. This is the same answer as https://stackoverflow.com/a/28159296/
I'll add other edits to make this post more useful.
pandas.DataFrame.query
query was made for exactly this purpose. Consider the dataframe df
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 5)),
columns=list('ABCDE')
)
df
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
6 8 7 6 4 7
7 6 2 6 6 5
8 2 8 7 5 8
9 4 7 6 1 5
Let's use query to filter all rows where D > B
df.query('D > B')
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
7 6 2 6 6 5
Which we chain
df.query('D > B').query('C > B')
# equivalent to
# df.query('D > B and C > B')
# but defeats the purpose of demonstrating chaining
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
4 3 6 7 7 4
5 5 3 7 5 9
7 6 2 6 6 5
My answer is similar to the others. If you do not want to create a new function you can use what pandas has defined for you already. Use the pipe method.
df.pipe(lambda d: d[d['column'] == value])
I had the same question except that I wanted to combine the criteria into an OR condition. The format given by Wouter Overmeire combines the criteria into an AND condition such that both must be satisfied:
In [96]: df
Out[96]:
A B C D
a 1 4 9 1
b 4 5 0 2
c 5 5 1 0
d 1 3 9 6
In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
A B C D
d 1 3 9 6
But I found that, if you wrap each condition in (... == True) and join the criteria with a pipe, the criteria are combined in an OR condition, satisfied whenever either of them is true:
df[((df.A==1) == True) | ((df.D==6) == True)]
Just want to add a demonstration using loc to filter not only by rows but also by columns and some merits to the chained operation.
The code below can filter the rows by value.
df_filtered = df.loc[df['column'] == value]
By modifying it a bit you can filter the columns as well.
df_filtered = df.loc[df['column'] == value, ['year', 'column']]
So why do we want a chained method? The answer is that it is simple to read if you have many operations. For example,
res = df\
.loc[df['station']=='USA', ['TEMP', 'RF']]\
.groupby('year')\
.agg(np.nanmean)
If you would like to apply all of the common boolean masks as well as a general purpose mask you can chuck the following in a file and then simply assign them all as follows:
pd.DataFrame = apply_masks()
Usage:
A = pd.DataFrame(np.random.randn(4, 4), columns=["A", "B", "C", "D"])
A.le_mask("A", 0.7).ge_mask("B", 0.2)... (May be repeated as necessary
It's a little bit hacky but it can make things a little bit cleaner if you're continuously chopping and changing datasets according to filters.
There's also a general purpose filter adapted from Daniel Velkov above in the gen_mask function which you can use with lambda functions or otherwise if desired.
File to be saved (I use masks.py):
import pandas as pd
def eq_mask(df, key, value):
return df[df[key] == value]
def ge_mask(df, key, value):
return df[df[key] >= value]
def gt_mask(df, key, value):
return df[df[key] > value]
def le_mask(df, key, value):
return df[df[key] <= value]
def lt_mask(df, key, value):
return df[df[key] < value]
def ne_mask(df, key, value):
return df[df[key] != value]
def gen_mask(df, f):
return df[f(df)]
def apply_masks():
pd.DataFrame.eq_mask = eq_mask
pd.DataFrame.ge_mask = ge_mask
pd.DataFrame.gt_mask = gt_mask
pd.DataFrame.le_mask = le_mask
pd.DataFrame.lt_mask = lt_mask
pd.DataFrame.ne_mask = ne_mask
pd.DataFrame.gen_mask = gen_mask
return pd.DataFrame
if __name__ == '__main__':
pass
This solution is more hackish in terms of implementation, but I find it much cleaner in terms of usage, and it is certainly more general than the others proposed.
https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py
You don't need to download the entire repo: saving the file and doing
from where import where as W
should suffice. Then you use it like this:
df = pd.DataFrame([[1, 2, True],
[3, 4, False],
[5, 7, True]],
index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire - or subset of a - DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])
A slightly less stupid usage example:
data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]
By the way: even in the case in which you are just using boolean cols,
df.loc[W['cond1']].loc[W['cond2']]
can be much more efficient than
df.loc[W['cond1'] & W['cond2']]
because it evaluates cond2 only where cond1 is True.
DISCLAIMER: I first gave this answer elsewhere because I hadn't seen this.
This is unappealing as it requires I assign df to a variable before being able to filter on its values.
df[df["column_name"] != 5].groupby("other_column_name")
seems to work: you can nest the [] operator as well. Maybe they added it since you asked the question.
So the way I see it is that you do two things when sub-setting your data ready for analysis.
get rows
get columns
Pandas has a number of ways of doing each of these and some techniques that help get rows and columns. For new Pandas users it can be confusing as there is so much choice.
Do you use iloc, loc, brackets, query, isin, np.where, mask etc...
Method chaining
Now method chaining is a great way to work when data wrangling. In R they have a simple way of doing it, you select() columns and you filter() rows.
So if we want to keep things simple in Pandas why not use the filter() for columns and the query() for rows. These both return dataframes and so no need to mess-around with boolean indexing, no need to add df[ ] round the return value.
So what does that look like:-
df.filter(['col1', 'col2', 'col3']).query("col1 == 'sometext'")
You can then chain on any other methods like groupby, dropna(), sort_values(), reset_index() etc etc.
By being consistent and using filter() to get your columns and query() to get your rows it will be easier to read your code when coming back to it after a time.
But filter can select rows?
Yes this is true but by default query() get rows and filter() get columns. So if you stick with the default there is no need to use the axis= parameter.
query()
query() can be used with both and/or &/| you can also use comparison operators > , < , >= , <=, ==, !=. You can also use Python in, not in.
You can pass a list to query using #my_list
Some examples of using query to get rows
df.query('A > B')
df.query('a not in b')
df.query("series == '2206'")
df.query("col1 == #mylist")
df.query('Salary_in_1000 >= 100 & Age < 60 & FT_Team.str.startswith("S").values')
filter()
So filter is basicly like using bracket df[] or df[[]] in that it uses the labels to select columns. But it does more than the bracket notation.
filter has like= param so as to help select columns with partial names.
df.filter(like='partial_name',)
filter also has regex to help with selection
df.filter(regex='reg_string')
So to sum up this way of working might not work for ever situation e.g. if you want to use indexing/slicing then iloc is the way to go. But this does seem to be a solid way of working and can simplify your workflow and code.
You can also leverage the numpy library for logical operations. Its pretty fast.
df[np.logical_and(df['A'] == 1 ,df['B'] == 6)]
If you set your columns to search as indexes, then you can use DataFrame.xs() to take a cross section. This is not as versatile as the query answers, but it might be useful in some situations.
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(3, size=(10, 5)),
columns=list('ABCDE')
)
df
# Out[55]:
# A B C D E
# 0 0 2 2 2 2
# 1 1 1 2 0 2
# 2 0 2 0 0 2
# 3 0 2 2 0 1
# 4 0 1 1 2 0
# 5 0 0 0 1 2
# 6 1 0 1 1 1
# 7 0 0 2 0 2
# 8 2 2 2 2 2
# 9 1 2 0 2 1
df.set_index(['A', 'D']).xs([0, 2]).reset_index()
# Out[57]:
# A D B C E
# 0 0 2 2 2 2
# 1 0 2 1 1 0

How can I swap half of two columns in a pandas dataframe in Python?

I am trying to create a machine learning model and teaching myself as I go. I will be working with a large dataset, but before I get to that, I am practicing with a smaller dataset to make sure everything is working as expected. I will need to swap half of the rows of two columns in my dataset, and I am not sure how to accomplish this.
Say I have a dataframe like the below:
index
number
letter
0
1
A
1
2
B
2
3
C
3
4
D
4
5
E
5
6
F
I want to randomly swap half of the rows of the number and letter columns, so one output could look like this:
index
number
letter
0
1
A
1
B
2
2
3
C
3
D
4
4
5
E
5
F
6
Is there a way to do this in python?
edit: thank you for all of your answers, I greatly appreciate it! :)
Here's one way to implement this.
import pandas as pd
from random import sample
df = pd.DataFrame({'index':range(6),'number':range(1,7),'letter':[*'ABCDEF']}).set_index('index')
n = len(df)
idx = sample(range(n),k=n//2) # randomly select which rows to switch
df = df.iloc[idx,:] = df.iloc[idx,::-1].values # switch those rows
An example result is
number letter
index
0 1 A
1 2 B
2 C 3
3 4 D
4 E 5
5 F 6
Update
To select randomly rows, use np.random.choice:
import numpy as np
idx = np.random.choice(df.index, len(df) // 2, replace=False)
df.loc[idx, ['letter', 'number']] = df.loc[idx, ['number', 'letter']].to_numpy()
print(df)
# Output
number letter
0 1 A
1 2 B
2 3 C
3 D 4
4 E 5
5 F 6
Old answer
You can try:
df.loc[df.index % 2 == 1, ['letter', 'number']] = \
df.loc[df.index % 2 == 1, ['number', 'letter']].to_numpy()
print(df)
# Output
number letter
0 1 A
1 B 2
2 3 C
3 D 4
4 5 E
5 F 6
For more readability, use an intermediate variable as a boolean mask:
mask = df.index % 2 == 1
df.loc[mask, ['letter', 'number']] = df.loc[mask, ['number', 'letter']].to_numpy()
You can create a copy of your original data, sample it, and then update it inplace- converting to a numpy ndarray to prevent index-alignment from occuring.
swapped_df = df.copy()
sample = swapped_df.sample(frac=0.5, random_state=0)
swapped_df.loc[sample.index, ['number', 'letter']] = sample[['letter', 'number']].to_numpy()
print(swapped_df)
number letter
index
0 1 A
1 B 2
2 C 3
3 4 D
4 E 5
5 6 F
>>>
Similar to previous answers but slightly more readable (in my opinion) if you are trying to build your sense for basic pandas operations:
rows_to_change = df.sample(frac=0.5)
rows_to_change = rows_to_change.rename(columns={'number':'letter', 'letter':'number'})
df.loc[rows_to_change.index] = rows_to_change

looping and with if statement over dataframe

I'm running into an issue when iterating over rows in a pandas data frame
this is the code I am trying to run
data = {'test':[1,1,0,0,3,1,0,3,0],
'test2':[0, 2, 0,1,1,2,7,3,2],
}
df = pd.DataFrame(data)
df['combined'] = df['test'] +df['test2']
df['combined'].astype('float64')
df
for index, row in df.iterrows():
if row['test']>=1 & row['test2']>=1:
row['combined']/=2
else:
pass
so, it should divide by 2 if both test and test2 have a value of 1 or more, however it doesn't divide all the rows that should be divided.
am I making a mistake somewhere?
this is the outcome when I run the code
corresponding columns are test, test2 and combined
0 1 0 1
1 1 2 3
2 0 0 0
3 0 1 1
4 3 1 2
5 1 2 3
6 0 7 7
7 3 3 3
8 0 2 2
You are using &, the bitwise AND operator. You should be using and, the boolean AND operator. This is causing the if statement to give an answer you don't expect.
What you are doing is in general a bad practice as iterating the rows should be avoided for performance reasons if is not strictly necessary, the solution is defining mask with your conditions and operate within the mask using .loc:
data = {'test':[1,1,0,0,3,1,0,3,0],
'test2':[0, 2, 0,1,1,2,7,3,2],
}
df = pd.DataFrame(data)
df['combined'] = df['test'] +df['test2']
df['combined'].astype('float64')
mask = (df['test']>=1) & (df['test2']>=1)
df.loc[mask,'combined'] /=2

sum of multiplication of cells in the same row but different column for pandas data frame

I have a data frame
df = pd.DataFrame({'A':[1,2,3],'B':[2,3,4]})
My data looks like this
Index A B
0 1 2
1 2 3
2 3 4
I would like to calculate the sum of multiplication between A and B in each row.
The expected result should be (1x2)+(2x3)+(3x4) = 2 + 6 + 12 = 20.
May I know the pythonic way to do this instead of looping?
You can try multiple columns A and B and then use sum :
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],'B':[2,3,4]})
print df
A B
0 1 2
1 2 3
2 3 4
print df['A'] * df['B']
0 2
1 6
2 12
dtype: int64
print (df['A'] * df['B']).sum()
20
Or use prod for multiple all columns:
print df.prod(axis=1)
0 2
1 6
2 12
dtype: int64
print df.prod(axis=1).sum()
20
Thank you ajcr for comment:
If you have just two columns, you can also use df.A.dot(df.B) for extra speed, but for three or more columns this is the way to do it!

Functional chaining / composing filter functions of DataFrame in Python? [duplicate]

Most operations in pandas can be accomplished with operator chaining (groupby, aggregate, apply, etc), but the only way I've found to filter rows is via normal bracket indexing
df_filtered = df[df['column'] == value]
This is unappealing as it requires I assign df to a variable before being able to filter on its values. Is there something more like the following?
df_filtered = df.mask(lambda x: x['column'] == value)
I'm not entirely sure what you want, and your last line of code does not help either, but anyway:
"Chained" filtering is done by "chaining" the criteria in the boolean index.
In [96]: df
Out[96]:
A B C D
a 1 4 9 1
b 4 5 0 2
c 5 5 1 0
d 1 3 9 6
In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
A B C D
d 1 3 9 6
If you want to chain methods, you can add your own mask method and use that one.
In [90]: def mask(df, key, value):
....: return df[df[key] == value]
....:
In [92]: pandas.DataFrame.mask = mask
In [93]: df = pandas.DataFrame(np.random.randint(0, 10, (4,4)), index=list('abcd'), columns=list('ABCD'))
In [95]: df.ix['d','A'] = df.ix['a', 'A']
In [96]: df
Out[96]:
A B C D
a 1 4 9 1
b 4 5 0 2
c 5 5 1 0
d 1 3 9 6
In [97]: df.mask('A', 1)
Out[97]:
A B C D
a 1 4 9 1
d 1 3 9 6
In [98]: df.mask('A', 1).mask('D', 6)
Out[98]:
A B C D
d 1 3 9 6
Filters can be chained using a Pandas query:
df = pd.DataFrame(np.random.randn(30, 3), columns=['a','b','c'])
df_filtered = df.query('a > 0').query('0 < b < 2')
Filters can also be combined in a single query:
df_filtered = df.query('a > 0 and 0 < b < 2')
The answer from #lodagro is great. I would extend it by generalizing the mask function as:
def mask(df, f):
return df[f(df)]
Then you can do stuff like:
df.mask(lambda x: x[0] < 0).mask(lambda x: x[1] > 0)
Since version 0.18.1 the .loc method accepts a callable for selection. Together with lambda functions you can create very flexible chainable filters:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.loc[lambda df: df.A == 80] # equivalent to df[df.A == 80] but chainable
df.sort_values('A').loc[lambda df: df.A > 80].loc[lambda df: df.B > df.A]
If all you're doing is filtering, you can also omit the .loc.
pandas provides two alternatives to Wouter Overmeire's answer which do not require any overriding. One is .loc[.] with a callable, as in
df_filtered = df.loc[lambda x: x['column'] == value]
the other is .pipe(), as in
df_filtered = df.pipe(lambda x: x.loc[x['column'] == value])
I offer this for additional examples. This is the same answer as https://stackoverflow.com/a/28159296/
I'll add other edits to make this post more useful.
pandas.DataFrame.query
query was made for exactly this purpose. Consider the dataframe df
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 5)),
columns=list('ABCDE')
)
df
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
6 8 7 6 4 7
7 6 2 6 6 5
8 2 8 7 5 8
9 4 7 6 1 5
Let's use query to filter all rows where D > B
df.query('D > B')
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
7 6 2 6 6 5
Which we chain
df.query('D > B').query('C > B')
# equivalent to
# df.query('D > B and C > B')
# but defeats the purpose of demonstrating chaining
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
4 3 6 7 7 4
5 5 3 7 5 9
7 6 2 6 6 5
My answer is similar to the others. If you do not want to create a new function you can use what pandas has defined for you already. Use the pipe method.
df.pipe(lambda d: d[d['column'] == value])
I had the same question except that I wanted to combine the criteria into an OR condition. The format given by Wouter Overmeire combines the criteria into an AND condition such that both must be satisfied:
In [96]: df
Out[96]:
A B C D
a 1 4 9 1
b 4 5 0 2
c 5 5 1 0
d 1 3 9 6
In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
A B C D
d 1 3 9 6
But I found that, if you wrap each condition in (... == True) and join the criteria with a pipe, the criteria are combined in an OR condition, satisfied whenever either of them is true:
df[((df.A==1) == True) | ((df.D==6) == True)]
Just want to add a demonstration using loc to filter not only by rows but also by columns and some merits to the chained operation.
The code below can filter the rows by value.
df_filtered = df.loc[df['column'] == value]
By modifying it a bit you can filter the columns as well.
df_filtered = df.loc[df['column'] == value, ['year', 'column']]
So why do we want a chained method? The answer is that it is simple to read if you have many operations. For example,
res = df\
.loc[df['station']=='USA', ['TEMP', 'RF']]\
.groupby('year')\
.agg(np.nanmean)
If you would like to apply all of the common boolean masks as well as a general purpose mask you can chuck the following in a file and then simply assign them all as follows:
pd.DataFrame = apply_masks()
Usage:
A = pd.DataFrame(np.random.randn(4, 4), columns=["A", "B", "C", "D"])
A.le_mask("A", 0.7).ge_mask("B", 0.2)... (May be repeated as necessary
It's a little bit hacky but it can make things a little bit cleaner if you're continuously chopping and changing datasets according to filters.
There's also a general purpose filter adapted from Daniel Velkov above in the gen_mask function which you can use with lambda functions or otherwise if desired.
File to be saved (I use masks.py):
import pandas as pd
def eq_mask(df, key, value):
return df[df[key] == value]
def ge_mask(df, key, value):
return df[df[key] >= value]
def gt_mask(df, key, value):
return df[df[key] > value]
def le_mask(df, key, value):
return df[df[key] <= value]
def lt_mask(df, key, value):
return df[df[key] < value]
def ne_mask(df, key, value):
return df[df[key] != value]
def gen_mask(df, f):
return df[f(df)]
def apply_masks():
pd.DataFrame.eq_mask = eq_mask
pd.DataFrame.ge_mask = ge_mask
pd.DataFrame.gt_mask = gt_mask
pd.DataFrame.le_mask = le_mask
pd.DataFrame.lt_mask = lt_mask
pd.DataFrame.ne_mask = ne_mask
pd.DataFrame.gen_mask = gen_mask
return pd.DataFrame
if __name__ == '__main__':
pass
This solution is more hackish in terms of implementation, but I find it much cleaner in terms of usage, and it is certainly more general than the others proposed.
https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py
You don't need to download the entire repo: saving the file and doing
from where import where as W
should suffice. Then you use it like this:
df = pd.DataFrame([[1, 2, True],
[3, 4, False],
[5, 7, True]],
index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire - or subset of a - DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])
A slightly less stupid usage example:
data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]
By the way: even in the case in which you are just using boolean cols,
df.loc[W['cond1']].loc[W['cond2']]
can be much more efficient than
df.loc[W['cond1'] & W['cond2']]
because it evaluates cond2 only where cond1 is True.
DISCLAIMER: I first gave this answer elsewhere because I hadn't seen this.
This is unappealing as it requires I assign df to a variable before being able to filter on its values.
df[df["column_name"] != 5].groupby("other_column_name")
seems to work: you can nest the [] operator as well. Maybe they added it since you asked the question.
So the way I see it is that you do two things when sub-setting your data ready for analysis.
get rows
get columns
Pandas has a number of ways of doing each of these and some techniques that help get rows and columns. For new Pandas users it can be confusing as there is so much choice.
Do you use iloc, loc, brackets, query, isin, np.where, mask etc...
Method chaining
Now method chaining is a great way to work when data wrangling. In R they have a simple way of doing it, you select() columns and you filter() rows.
So if we want to keep things simple in Pandas why not use the filter() for columns and the query() for rows. These both return dataframes and so no need to mess-around with boolean indexing, no need to add df[ ] round the return value.
So what does that look like:-
df.filter(['col1', 'col2', 'col3']).query("col1 == 'sometext'")
You can then chain on any other methods like groupby, dropna(), sort_values(), reset_index() etc etc.
By being consistent and using filter() to get your columns and query() to get your rows it will be easier to read your code when coming back to it after a time.
But filter can select rows?
Yes this is true but by default query() get rows and filter() get columns. So if you stick with the default there is no need to use the axis= parameter.
query()
query() can be used with both and/or &/| you can also use comparison operators > , < , >= , <=, ==, !=. You can also use Python in, not in.
You can pass a list to query using #my_list
Some examples of using query to get rows
df.query('A > B')
df.query('a not in b')
df.query("series == '2206'")
df.query("col1 == #mylist")
df.query('Salary_in_1000 >= 100 & Age < 60 & FT_Team.str.startswith("S").values')
filter()
So filter is basicly like using bracket df[] or df[[]] in that it uses the labels to select columns. But it does more than the bracket notation.
filter has like= param so as to help select columns with partial names.
df.filter(like='partial_name',)
filter also has regex to help with selection
df.filter(regex='reg_string')
So to sum up this way of working might not work for ever situation e.g. if you want to use indexing/slicing then iloc is the way to go. But this does seem to be a solid way of working and can simplify your workflow and code.
You can also leverage the numpy library for logical operations. Its pretty fast.
df[np.logical_and(df['A'] == 1 ,df['B'] == 6)]
If you set your columns to search as indexes, then you can use DataFrame.xs() to take a cross section. This is not as versatile as the query answers, but it might be useful in some situations.
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(3, size=(10, 5)),
columns=list('ABCDE')
)
df
# Out[55]:
# A B C D E
# 0 0 2 2 2 2
# 1 1 1 2 0 2
# 2 0 2 0 0 2
# 3 0 2 2 0 1
# 4 0 1 1 2 0
# 5 0 0 0 1 2
# 6 1 0 1 1 1
# 7 0 0 2 0 2
# 8 2 2 2 2 2
# 9 1 2 0 2 1
df.set_index(['A', 'D']).xs([0, 2]).reset_index()
# Out[57]:
# A D B C E
# 0 0 2 2 2 2
# 1 0 2 1 1 0

Categories