How can I perform comparisons between DataFrames and Series? I'd like to mask elements in a DataFrame/Series that are greater/less than elements in another DataFrame/Series.
For instance, the following doesn't replace elements greater than the mean
with nans although I was expecting it to:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x[x > x.mean(axis=1)] = np.nan
>>> x
a b
0 1 3
1 2 4
If we look at the boolean array created by the comparison, it is really weird:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x > x.mean(axis=1)
a b 0 1
0 False False False False
1 False False False False
I don't understand by what logic the resulting boolean array is like that. I'm able to work around this problem by using transpose:
>>> (x.T > x.mean(axis=1).T).T
a b
0 False True
1 False True
But I believe there is some "correct" way of doing this that I'm not aware of. And at least I'd like to understand what is going on.
The problem here is that it's interpreting the index as column values to perform the comparison, if you use .gt and pass axis=0 then you get the result you desire:
In [203]:
x.gt(x.mean(axis=1), axis=0)
Out[203]:
a b
0 False True
1 False True
You can see what I mean when you perform the comparison with the np array:
In [205]:
x > x.mean(axis=1).values
Out[205]:
a b
0 False False
1 False True
here you can see that the default axis for comparison is on the column, resulting in a different result
Related
My requirement is I have a large dataframe with millions of rows. I encoded all strings to numeric values in order to use numpys vectorization to increase processing speed.
So I was looking at a way to quickly check if a number exists in another list column. Previously, I was using list comprehension with string values, but with after converting to np.arrays was looking at similar function.
I stumbled across this link: check if values of a column are in values of another numpy array column in pandas
In order to the numpy.isin, I tried running below code:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,5,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 5 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
When I enter:
np.isin(dt['col_a'], dt['col_b'])
The output is:
array([False, True, False, False, True])
Which is incorrect as the 3rd row has 5 in both columns col_a and col_b.
Where as if I change the value to 4 as below:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,4,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 4 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
and execute same code:
np.isin(dt['col_a'], dt['col_b'])
I get correct result:
array([False, True, True, False, True])
Can someone please let me know why it's giving different results.
Since col_b not only has lists but also integers, you may need to use apply and treat them differently:
( dt.apply(lambda x: x['col_a'] in x['col_b'] if type(x['col_b']) is list
else x['col_a'] == x['col_b'], axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool
np.isin for each element from dt['col_a'] checks whether it is present in the whole dt['col_b'] column, i.e.:
[
1 in dt['col_b'],
2 in dt['col_b'],
5 in dt['col_b'],
...
]
There's no 5 in dt['col_b'] but there's 4
From the docs
isin is an element-wise function version of the python keyword in. isin(a, b) is roughly equivalent to np.array([item in b for item in a]) if a and b are 1-D sequences.
Also, your issue is that you have an inconsistent dt['col_b'] column (some values are numbers some are lists). I think the easiest approach is to use apply:
def isin(row):
if isinstance(row['col_b'], int):
return row['col_a'] == row['col_b']
else:
return row['col_a'] in row['col_b']
dt.apply(isin, axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool
I have a dataframe that might contain NaN values.
array = np.empty((4,5))
array[:] = 10
df = pd.DataFrame(array)
df.iloc[1,3] = np.NaN
df.isna().apply(lambda x: any(x), axis = 0)
Output:
0 False
1 False
2 False
3 True
4 False
dtype: bool
When I run:
any(df.isna())
It returns:
True
If there are no NaNs:
array = np.empty((4,5))
array[:] = 10
df = pd.DataFrame(array)
#df.iloc[1,3] = np.NaN
df.isna().apply(lambda x: any(x), axis = 0)
0 False
1 False
2 False
3 False
4 False
dtype: bool
However when I run:
any(df.isna())
It returns:
True
Why this is the case? Do I have any misunderstanding of the function any()?
Why this is the case? Do I have any misunderstanding of the function any()?
When you loop over a DataFrame you are actually iterating over its column labels, not its rows or values as you might think. More precisely, the for loop calls Dataframe.__iter__ which returns an iterator over the column labels of the DataFrame.
For instance, in the following
df = pd.DataFrame(columns=['a', 'b', 'c'])
for x in df:
print(x)
# Output:
#
# a
# b
# c
x holds the name of each df column. You can also see what is the output of list(df).
This means that when you do any(df.isna()), under the hood any is actually iterating over the column labels of df and checking their truthiness. If at least one is truthy it returns True.
In both of your examples the column labels are numbers list(df.isna()) = list(df.columns) = [0, 1, 2, 3], from which only 0 is a Falsy value. Therefore, in both cases any(df.isna()) = True.
Solution
The solution is to use DataFrame.any with axis=None instead of using the built-in any function.
df.isna().any(axis=None)
I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)
I have the following DataFrame:
df = pd.DataFrame([[np.nan, 1], [2, 3]], dtype='float64')
When I check the equality of the values with df == 1, I get the following DataFrame:
0 1
0 False True
1 False False
Which I consider a normal behaviour. However, if I choose 'Int64' (capital I, because 'int64' does not have NaNs) instead of 'float64':
df = pd.DataFrame([[np.nan, 1], [2, 3]], dtype='Int64')
Which, printed out, is:
0 1
0 <NA> 1
1 2 3
and I try the same comparison as before (df == 1), I get:
0 1
0 <NA> False
1 False False
First of all, I don't see why 1 == 1 would yield False (0, 1). Then, I don't see either why the comparison with <NA> does not yield False as it does with floats.
Is there another way of comparing than == which would make this work?
EDIT:
My pandas version is 1.0.4
I don't see either why the comparison with <NA> does not yield False as it does with floats. Is there another way of comparing than == which would make this work?
df = pd.DataFrame([[np.nan, 1], [2, 3]], dtype='Int64')
df.notna() & df.eq(1)
# 0 1
#0 False True
#1 False False
<NA> propagates in any binary operation (source). Please also note the following warning:
Experimental: the behaviour of NA can still change without warning.
See also the example "comparison" in the docs which corresponds to your example.
To be as concise as possible, I ended up using:
(df == 1) is True
Which is False if df == 1 yields <NA>
I'm trying to replace some empty list in my data with a NaN values. But how to represent an empty list in the expression?
import numpy as np
import pandas as pd
d = pd.DataFrame({'x' : [[1,2,3], [1,2], ["text"], []], 'y' : [1,2,3,4]})
d
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 [] 4
d.loc[d['x'] == [],['x']] = d.loc[d['x'] == [],'x'].apply(lambda x: np.nan)
d
ValueError: Arrays were different lengths: 4 vs 0
And, I want to select [text] by using d[d['x'] == ["text"]] with a ValueError: Arrays were different lengths: 4 vs 1 error, but select 3 by using d[d['y'] == 3] is correct. Why?
If you wish to replace empty lists in the column x with numpy nan's, you can do the following:
d.x = d.x.apply(lambda y: np.nan if len(y)==0 else y)
If you want to subset the dataframe on rows equal to ['text'], try the following:
d[[y==['text'] for y in d.x]]
I hope this helps.
You can use function "apply" to match the specified cell value no matter it is the instance of string, list and so on.
For example, in your case:
import pandas as pd
d = pd.DataFrame({'x' : [[1,2,3], [1,2], ["text"], []], 'y' : [1,2,3,4]})
d
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 [] 4
if you use d == 3 to select the cell whose value is 3, it's totally ok:
x y
0 False False
1 False False
2 False True
3 False False
However, if you use the equal sign to match a list, there may be out of your exception, like d == [text] or d == ['text'] or d == '[text]', such as the following:
There's some solutions:
Use function apply() on the specified Series in your Dataframe just like the answer on the top:
A more general method with the function applymap() on a Dataframe may be used for the preprocessing step:
d.applymap(lambda x: x == [])
x y
0 False False
1 False False
2 False False
3 True False
Wish it can help you and the following learners and it would be better if you add a type check in you applymap function which would otherwise cause some exceptions probably.
To answer your main question, just leave out the empty lists altogether. The NaN's will automatically get populated in if there's a value in one column and not the other if you use pandas.concat instead of building a dataframe from a dictionary.
>>> import pandas as pd
>>> ser1 = pd.Series([[1,2,3], [1,2], ["text"]], name='x')
>>> ser2 = pd.Series([1,2,3,4], name='y')
>>> result = pd.concat([ser1, ser2], axis=1)
>>> result
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [text] 3
3 NaN 4
About your second question, it seems that you can't search inside of an element. Perhaps you should make that a separate question since it's not really related to your main question.