How to fllter a pandas DataFrame by multiple columns? - python

I'm trying to subset a pandas DataFrame in python based on two logical statements
i.e.
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df[df.col1 = 1 and df.col2 = 3]
but I'm getting invalid syntax on line3.
Is there a way to do this in one line?

You need to use logical operators. == is equals for returning boolean, = is setting a value.
Try:
df[(df.col1 == 1) & (df.col2 == 3)]

Disclaimer: as mentioned by #jp_data_analysis and pandas docs, the following solution is not the best one given it uses chained indexing. Please refer to Matt W. and AChampion solution.
An alternative one line solution.
>>> d = {'col1': [1, 2, 1], 'col2': [3, 4, 2]}
>>> df = pd.DataFrame(data=d)
>>> df[df.col1==1][df.col2==3]
col1 col2
0 1 3
I have added a third row, with 'col1'=1 and 'col2'=2, so we can have an extra negative test case.

Related

How to perform operations over arrays in a pandas dataframe efficiently?

I've got a pandas DataFrame that contains NumPy arrays in some columns:
import numpy as np, pandas as pd
data = {'col1': [np.array([1, 2]), np.array([3, 4])],
'col2': [np.array([5, 6]), np.array([7, 8])],
'col3': [9, 10]}
df = pd.DataFrame(data)
I need to store a large frame like this one in a CSV file, but the arrays have to be strings that look like this:
col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10
What I'm currently doing to achieve this result is to iterate over each column and each row of the DataFrame, but my solution doesn't seem efficient.
This is my current solution:
pd.options.mode.chained_assignment = None
array_columns = [column for column in df.columns if isinstance(df[column].iloc[0], np.ndarray)]
for index, row in df.iterrows():
for column in array_columns:
# Here 'tuple' is only used to replace brackets for parenthesis
df[column][index] = str(tuple(row[column]))
I tried using apply, although I've heard it's usually not an efficient alternative:
def array_to_str(array):
return str(tuple(array))
df[array_columns] = df[array_columns].apply(array_to_str)
But my arrays become NaN:
col1 col2 col3
0 NaN NaN 9
1 NaN NaN 10
I tried other similar solutions, but the error:
ValueError: Must have equal len keys and value when setting with an iterable
appeared quite often.
Is there a more efficient way of performing the same operation? My real dataframes can contain many columns and thousands of rows.
Try this:
tupcols = ['col1', 'col2']
df[tupcols] = df[tupcols].apply(lambda col: col.apply(tuple)).astype('str')
df.to_csv()
You would need to convert the arrays into tuple for the correct representation. In order to do so, you can apply tuple function on columns with object dtype.
to_save = df.apply(lambda x: x.map(lambda y: tuple(y)) if x.dtype=='object' else x)
to_save.to_csv(index=False)
Output:
col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10
Note: This would be dangerous if you have other columns, e.g. string type.
data = {'col1': [np.array([1, 2]), np.array([3, 4])],
'col2': [np.array([5, 6]), np.array([7, 8])],
'col3': [9, 10]}
df = pd.DataFrame(data)
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: tuple(x))
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: ''' "{}" '''.format(x))
col1 col2 col3
0 "(1, 2)" "(5, 6)" 9
1 "(3, 4)" "(7, 8)" 10

Compare multiple columns of two data frames using pandas

I have two data frames; df1 has Id and sendDate and df2 has Id and actDate. The two df's are not the same shape - df2 is a lookup table. There may be multiple instances of Id.
ex.
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
I want to add a boolean True/False in df1 to find when df1.Id == df2.Id and df1.sendDate == df2.actDate.
Expected output would add a column to df1:
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"],
"Match?": [True, False, False, False, True]})
I'm new to python from R, so please let me know what other info you may need.
Use isin and boolean indexing
import pandas as pd
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11",
"2018-01-06", "2018-01-06",
"2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
df1['Match'] = (df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate']))
print(df1)
Output:
Id sendDate Match
0 1 2019-09-24 True
1 1 2020-09-11 True
2 2 2018-01-06 False
3 3 2018-01-06 False
4 2 2019-09-24 True
The .isin() approaches will find values where the ID and date entries don't necessarily appear together (e.g. Id=1 and date=2020-09-11 in your example). You can check for both by doing a .merge() and checking when df2's date field is not null:
df1['match'] = df1.merge(df2, how='left', left_on=['Id', 'sendDate'], right_on=['Id', 'actDate'])['actDate'].notnull()
A vectorized approach via numpy -
import numpy as np
df1['Match'] = np.where((df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate'])),True,False)
You can use .isin():
df1['id_bool'] = df1.Id.isin(df2.Id)
df1['date_bool'] = df1.sendDate.isin(df2.actDate)
Check out the documentation here.

creating a column based on missing value in pandas

I have a data-frame for which want to create a column that represents missing value patterns in data-frame.For example :
for example for the CSV file,
A,B,C,D
1,NaN,NaN,NaN
Nan,2,3,NaN
3,2,2,3
3,2,NaN,3
3,2,1,NaN
I want to create a column E,which has value in following way:
If A,B,C,D all are missing E = 4,
If A,B,C,D all are present E = 0,
if A and B are only missing E = 1 of that sort, encoding of E need not be like I mentioned just an indication of pattern.How can I come across this problem in pandas?
use isnull in combination with sum(axis=1)
Example:
import pandas as pd
df = pd.DataFrame({'A': [1, None, 3, 3, 3],
'B':[ None, None, 1, 1, 1]})
df['C'] = df.isnull().sum(axis=1)

Python Pandas DataFrame: how to locate an element by positional (int) in index *and* by label (str) in columns?

Just saw this:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Apparently the .ix() operator is now deprecated. Wondering how to do something like this:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=pd.DatetimeIndex(['2017-01-01', '2017-01-03', '2017-01-05']))
wanted_int_index = df.index.get_loc('2017-01-04', method='ffill') # index_id = 2
wanted_str_column = 'a'
value = df.ix[wanted_int_index, wanted_str_column] # value = 2
print(value)
# 2
My understanding is that .loc() excepts label (str) for both index and columns, while .iloc() excepts position (int) for both index and columns. Am I missing a usage here?
loc should work for non-datetime indexing.
import pandas as pd
import numpy as np
data = np.random.rand(10)
df = pd.DataFrame(data, index=range(10),columns=['A'])
print(df.loc[1,'A']) #this works
For datetimes, like you have, you need to index using them. Ie,
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]},
index=pd.DatetimeIndex(['2017-01-01', '2017-01-03', '2017-01-05']))
wanted_int_index = df.index.get_loc('2017-01-04', method='ffill') # index_id = 2
wanted_str_column = 'a'
value = df.loc[df.index[wanted_int_index], wanted_str_column] # value = 2
print(value) #this works

Detect Missing Column Labels in Pandas

I'm working with the dataset outlined here:
https://archive.ics.uci.edu/ml/datasets/Balance+Scale
I'm trying create a general function to be able to parse any categorical data following these two rules:
Must have a column labeled class containing the class of the object
Each row must have the same numbers of columns
Minimal example of the data that I'm working with:
Class,LW,LD,RW,RD
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
This provides 3 unique classes: B, L, R. It also provides 4 features which pertain to each entry: LW, LD, RW and RD.
The following is a part of my function to handle generic cases, but my issue with it is that I don't know how to check if any column labels are simply missing:
import pandas as pd
import sys
dataframe = pd.read_csv('Balance_Data.csv')
columns = list(dataframe.columns.values)
if "Class" not in columns:
sys.exit("'Class' is not a column in the data")
if "Class.1" in columns:
sys.exit("Cannot specify more than one 'Class' column")
columns.remove("Class")
inputX = dataframe.loc[:, columns].as_matrix()
inputY = dataframe.loc[:, ['Class']].as_matrix()
At this point, the correct values are:
inputX = array([[1, 1, 1, 1],
[1, 2, 1, 1],
[1, 2, 1, 3],
[2, 2, 4, 5]])
inputY = array([['B'],
['L'],
['R'],
['R'],
['R'],
['R']], dtype=object)
But if I remove the last column label (RD) and reprocess,
Class,LW,LD,RW
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
I get:
inputX = array([[1, 1, 1],
[2, 1, 1],
[2, 1, 3],
[2, 4, 5]])
inputY = array([[1],
[1],
[1],
[2]])
This indicates that it reads label values from right to left instead of left to right, which means that if any data is input into this function that doesn't have the right amount of labels, it's not going to work correctly.
How can I check that the dimension of the rows is the same as the number of columns? (It can be assumed that there are no gaps in the data itself, that each row of data beyond the columns always has the same number of elements in it)
I would pull it out as follows:
In [11]: df = pd.read_csv('Balance_Data.csv', index_col=0)
In [12]: df
Out[12]:
LW LD RW RD
Class
B 1 1 1 1
L 1 2 1 1
R 1 2 1 3
R 2 2 4 5
That way the assertion check can be:
if "Class" in df.columns:
sys.exit("class must be the first and only the column and number of columns must match all rows")
and then check that the there are no NaNs in the last column:
In [21]: df.iloc[:, -1].notnull().all()
Out[21]: True
Note: this happens e.g. with the following (bad) csv:
In [31]: !cat bad.csv
A,B,C
1,2
3,4
In [32]: df = pd.read_csv('bad.csv', index_col=0)
In [33]: df
Out[33]:
B C
A
1 2 NaN
3 4 NaN
In [34]: df.iloc[:, -1].notnull().all()
Out[34]: False
I think these are the only two failing cases (but I think the error messages can be made clearer)...

Categories