pandas.DataFrame's binary operators permute index order - python

The binary operators in pandas.DataFrame, such as the gt greater than operator, seem to permute the DataFrame's index/columns in undesirable ways. The issue seems to be that it union's together mismatched indices
Mismatched indices will be unioned together. NaN values are considered
different (i.e. NaN != NaN).
and that pd.Index.union (which I assume is called internally by the binary operators) sorts the indices by default. Yet this behavior is undesirable for many (most?) applications of the binary operator and there does not appear to be a clean way to override it.
Example
Here is an example to demonstrate. Setting up the data:
import pandas as pd
df = pd.DataFrame({'volume': [1, 2, 3],
'cost': [250, 150, 100],
'revenue': [100, 250, 300]},
index=['A', 'B', 'C'])
thresh = pd.Series({'volume': 2, 'cost': 125, 'revenue': 200})
and printing it:
print(df)
volume cost revenue
A 1 250 100
B 2 150 250
C 3 100 300
print(thresh)
volume 2
cost 125
revenue 200
dtype: int64
Since the indices of the DataFrame and Series match exactly (including order), the binary operator preserves order:
print(df.gt(thresh, axis=1)) # preserves column order
volume cost revenue
A False True False
B False True True
C True False True
What happens when we apply the binary operator with a Series whose indices (i) do not match the DataFrame order, and (ii) are not sorted?
thresh_perm = thresh.loc[['revenue', 'volume', 'cost']]
print(df.gt(thresh_perm, axis=1))
cost revenue volume
A True False False
B True True True
C False True True
The output column order matches neither the input DataFrame column order nor the Series index order. Instead it is sorted by string name.
Conclusion
Is my analysis correct (i.e., that pandas binary operators always sort the indices)? What is the best way to fix this? Ideally the binary operators would have an option that controls output order, such as:
df.gt(thresh_perm, axis=1, sort=False) # desired call; code does not work
Since pandas.Index.union has a sort option I think it makes sense for binary operators to be able to pass the option through.
For now, the best option seems to be to manually restore the column order:
print(df.gt(thresh_perm, axis=1).loc[:, df.columns])
volume cost revenue
A False True False
B False True True
C True False True

Related

Removing columns if values are not in an ascending order python

Given a data like so:
Symbol
One
Two
1
28.75
25.10
2
29.00
25.15
3
29.10
25.00
I want to drop the column which does not have its values in an ascending order (though I want to allow for gaps) across all rows. In this case, I want to drop column 'Two'.I tried to following code with no luck:
df.drop(df.columns[df.all(x <= y for x,y in zip(df, df[1:]))])
Thanks
Dropping those columns that give at least one (any) negative value (lt(0)) when their values are differenced by 1 lag (diff(1)) after NaNs are neglected (dropna):
columns_to_drop = [col for col in df.columns if df[col].diff(1).dropna().lt(0).any()]
df.drop(columns=columns_to_drop)
Symbol One
0 1 28.75
1 2 29.00
2 3 29.10
An expression that works with gaps (NaN)
A.loc[:, ~(A.iloc[1:, :].reset_index() > A.iloc[:-1, :].reset_index()).any()]
Without gaps it would be equivalent to
A.loc[:, (A.iloc[1:, :].reset_index() <= A.iloc[:-1, :].reset_index()).all()]
Without loops to take better advantage of the framework for bigger dataframes.
A.iloc[1:, :] returns a dataframe without the first line
A.iloc[:-1, :] returns a dataframe without the last line
Slices in a dataframe keep the indices for corresponding rows, so the different slices have different indices, reset_index will create another index counting [0,1,...], thus making the two sides of the inequality compatible. You can pass drop=True if you want to remove the previous index.
Any (implicitly with axis=0) check for every column if any value is true, if so, it means that a number was followed by another.
A.loc[:, mask] select the columns where mask is true, drops the columns where mask is false.
The logic is could be read as not any value smaller than its predecessor or all values greater than its predecessor.
Check out code and only logic is:
map(lambda i: list(df[i]) == sorted(list(df[i])), df.columns)]
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'Symbol': [1, 2, 3],
'One': [28.75, 29.00, 29.10],
'Two': [25.10, 25.15, 25.10],
}
)
print(df.loc[:,map(lambda i: list(df[i]) == sorted(list(df[i])), df.columns)])

Count occurences of True/False in column of dataframe

Is there a way to count the number of occurrences of boolean values in a column without having to loop through the DataFrame?
Doing something like
df[df["boolean_column"]==False]["boolean_column"].sum()
Will not work because False has a value of 0, hence a sum of zeroes will always return 0.
Obviously you could count the occurrences by looping over the column and checking, but I wanted to know if there's a pythonic way of doing this.
Use pd.Series.value_counts():
>> df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
>> df['boolean_column'].value_counts()
True 3
False 2
Name: boolean_column, dtype: int64
If you want to count False and True separately you can use pd.Series.sum() + ~:
>> df['boolean_column'].values.sum() # True
3
>> (~df['boolean_column']).values.sum() # False
2
With Pandas, the natural way is using value_counts:
df = pd.DataFrame({'A': [True, False, True, False, True]})
print(df['A'].value_counts())
# True 3
# False 2
# Name: A, dtype: int64
To calculate True or False values separately, don't compare against True / False explicitly, just sum and take the reverse Boolean via ~ to count False values:
print(df['A'].sum()) # 3
print((~df['A']).sum()) # 2
This works because bool is a subclass of int, and the behaviour also holds true for Pandas series / NumPy arrays.
Alternatively, you can calculate counts using NumPy:
print(np.unique(df['A'], return_counts=True))
# (array([False, True], dtype=bool), array([2, 3], dtype=int64))
I couldn't find here what I exactly need. I needed the number of True and False occurrences for further calculations, so I used:
true_count = (df['column']).value_counts()[True]
False_count = (df['column']).value_counts()[False]
Where df is your DataFrame and column is the column with booleans.
This alternative works for multiple columns and/or rows as well. 
df[df==True].count(axis=0)
Will get you the total amount of True values per column. For row-wise count, set axis=1. 
df[df==True].count().sum()
Adding a sum() in the end will get you the total amount in the entire DataFrame.
You could simply sum:
sum(df["boolean_column"])
This will find the number of "True" elements.
len(df["boolean_column"]) - sum(df["boolean_column"])
Will yield the number of "False" elements.
df.isnull()
returns a boolean value. True indicates a missing value.
df.isnull().sum()
returns column wise sum of True values.
df.isnull().sum().sum()
returns total no of NA elements.
In case you have a column in a DataFrame with boolean values, or even more interesting, in case you do not have it but you want to find the number of values in a column satisfying a certain condition you can try something like this (as an example I used <=):
(df['col']<=value).value_counts()
the parenthesis create a tuple with # of True/False values which you can use for other calcs as well, accessing the tuple adding [0] for False counts and [1] for True counts even without creating an additional variable:
(df['col']<=value).value_counts()[0] #for falses
(df['col']<=value).value_counts()[1] #for trues
Here is an attempt to be as literal and brief as possible in providing an answer. The value_counts() strategies are probably more flexible at the end. Accumulation sum and counting count are different and each expressive of an analytical intent, sum being dependent on the type of data.
"Count occurences of True/False in column of dataframe"
import pd
df = pd.DataFrame({'boolean_column': [True, False, True, False, True]})
df[df==True].count()
#boolean_column 3
#dtype: int64
df[df!=False].count()
#boolean_column 3
#dtype: int64
df[df==False].count()
#boolean_column 2
#dtype: int64

Splitting a dataframe based on condition

I am trying to split my dataframe into two based of medical_plan_id. If it is empty, into df1. If not empty into df2.
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]
The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison").
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
How to handle such situation?
My df_with_medicalplanid looks like below:
wellthie_issuer_identifier ... medical_plan_id
0 UHC99806 ... None
1 UHC99806 ... None
Use ==, not is, to test equality
Likewise, use != instead of is not for inequality.
is has a special meaning in Python. It returns True if two variables point to the same object, while == checks if the objects referred to by the variables are equal. See also Is there a difference between == and is in Python?.
Don't repeat mask calculations
The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~ ("tilde"), also accessible via operator.invert, to negate an existing mask.
Empty strings are different to null values
Equality versus empty strings can be tested via == '', but equality versus null values requires a specialized method: pd.Series.isnull. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan, and np.nan != np.nan by design.
If you want to replace empty strings with null values, you can do so:
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
Conceptually, it makes sense for missing values to be null (np.nan) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:
df['medical_plan_id'] = df['medical_plan_id'].fillna('')
If the difference matters, you need to know your data and apply the appropriate logic.
Semi-final solution
Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:
mask = df['medical_plan_id'].isnull()
df1 = df[mask]
df2 = df[~mask]
Final solution: avoid extra variables
Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy with dict to give a dictionary of dataframes with False (== 0) and True (== 1) keys corresponding to your masks:
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
Then dfs[0] represents df2 and dfs[1] represents df1 (see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy methods:
dfs = df.groupby(df['medical_plan_id'].isnull())
dfs.get_group(0) # equivalent to dfs[0] from dict solution
dfs.get_group(1) # equivalent to dfs[1] from dict solution
Example
Putting all the above in action:
df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
'values': [1, 2, 3, 4, 5, 6, 7]})
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
print(dfs[0], dfs[1], sep='\n'*2)
medical_plan_id values
2 2134.0 3
3 4325.0 4
4 6543.0 5
medical_plan_id values
0 NaN 1
1 NaN 2
5 NaN 6
6 NaN 7
Another variant is to unpack df.groupby, which returns an iterator with tuples (first item being the element of groupby and the second being the dataframe).
Like this for instance:
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
_ is in Python used to mark variables that are not interested to keep. I have separated the code to two lines for readability.
Full example
import pandas as pd
df_with_medicalplanid = pd.DataFrame({
'medical_plan_id': ['214212','','12251','12421',''],
'value': 1
})
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
print(df1)
Returns:
medical_plan_id value
0 214212 1
2 12251 1
3 12421 1
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
# Anton missed cond in right side bracket
print(df1)

pandas lookup with long and nested conditions

I want to perform a lookup in a dataframe via pandas. But It will be created by a series of nested if else statement similar as outlined Pandas dataframe add a field based on multiple if statements
But I want to use up to 13 different variables. This seems to pretty soon result in chaos. Is there some notation or other nice feature which allows me to specify such long and nested conditions in pandas?
So far np.where() http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html might be my best option.
Is there a shortcut if I would only match for equality in all conditions?
Am I forced to write out each conditional filter? Ir could I just have a single expression which is choosing a (single) lookup value which is produced.
Edit Ideally I would not want to match
df.loc[df['column_name'] == some_value]
for each value ie. 13* number of categorical levels (lets assume 7) would be a lot of different values; especially, if df.loc[df['fist'] == some_value][df['second'] == otherValue1] combination of conditions occur i.e. they are all chained.
edit
A minimal example
df = pd.DataFrame({'ageGroup': [1, 2, 2, 1],
'first2DigitsOfPostcode': ['12', '23', '12', '12'],
'valueOfProduct': ['low', 'medum', 'high', 'low'],
'lookup_join_value': ['foo', 'bar', 'foo', 'baz']})
defines the lookup table which was generated by a sql query grouping by all the columns and aggregating the values (so due to the Cartesian product all. value combinations should be represented in the lookup table.
A new record could look like
new_values = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct': ['low']})
How can I sort of automate the lookup of all the conditions assuming all conditions require an match by equality (if this makes it easier.
I found
pd.lookup Vectorized look-up of values in Pandas dataframe which seems to work for a single column / condition
maybe a merge could be a solution? Python Pandas: DataFrame as a Lookup Table, but that not really produce the desired lookup result.
edit 2
The second answer seems to be pretty interesting. But
mask = df.drop('lookup_join_value', axis=1).isin(new_values)
print(mask)
print(df[mask])
print(df[mask]['lookup_join_value'])
will unfortunately just return NaN for the lookup value.
Now that I better know what you're after, a dataframe merge is likely a much better choice:
IN: df.merge(new_values, how='inner')
OUT: ageGroup first2DigitsOfPostcode lookup_join_value valueOfProduct
0 1 12 foo low
1 1 12 baz low
Certainly shorter than the other answer I gave! I'll leave the old one though in case it inspires someone else.
I think df.isin() is along the lines of what you're looking for.
Using your example df, and these two:
exists = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'low'})
new = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'high'})
Then you can check to see what values match, if all, or just some:
df.isin(exists.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True True
1 False False False
2 False True False
3 True True True
df.isin(new.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True False
1 False False False
2 False True True
3 True True False
If your "query" wasn't a dataframe but instead a list it wouldn't need the ".values[0]" bit. The problem with a dictionary is it tries to match the index as well.
It's not clear to me from your question exactly what you want returned, but you could then subset based on whether all (or some) of the rows are the same:
# Returns matching rows
df[df.isin(exists.values[0]).values.all(True)]
# Returns rows where the first two columns match
matches = df.isin(new.values[0]).values
df[[item==[True,True,False] for item in matches.tolist()]]
...There might be a smarter way to write that last one.
df.query is an option if you can write if can the query as and expression using the column names:
so you can do:
query_string = 'some long (but valid) boolean query'
example from pandas:
>>> from numpy.random import randn
>>> from pandas import DataFrame
>>> df = DataFrame(randn(10, 2), columns=list('ab'))
>>> df.query('a > b')
# similar to this
>>> df[df.a > df.b]
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

Fail to filter pandas dataframe by categorical column

pandas 0.16.1
I converted all columns in dataframe to categoricals so it takes MUCH less space when dumped to disk. Now i want to filter dataframe. It's ok with == and .isin but fails on <, <=, etc. operations with "Unordered Categoricals can only compare equality or not"
data[data["MONTH COLUMN"]<=3]
If i comment out the following lines in categorical.py everything works fine. Is it a bug in pandas?
if not self.ordered:
if op in ['__lt__', '__gt__','__le__','__ge__']:
raise TypeError("Unordered Categoricals can only compare equality or not")
I think it was a good idea to use Categorical datatype on column which has only 12 unique values in ~1'400'000 rows.)
The documentation states:
Note New categorical data are NOT automatically ordered. You must explicity pass ordered=True to indicate an ordered Categorical.
When you first create a category you want to be ordered, just specify this:
In [1]: import pandas as pd
In [3]: s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
In [5]: s
Out[5]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a < b < c]
In [4]: s > 'a'
Out[4]:
0 False
1 True
2 True
3 False
dtype: bool

Categories