I have a dataframe of dates:
>>> d.head()
Out[55]:
0 2010-06-01
1 2010-06-02
2 2010-06-03
3 2010-06-04
4 2010-06-07
dtype: datetime64[ns]
I am not able to check whether a given date in contained in it:
>>> d.iloc[1]
Out[59]: Timestamp('2010-06-02 00:00:00')
>>> d.iloc[1] in d
Out[60]: False
>>> np.datetime64(d.iloc[1]) in d
Out[61]: False
>>> d.iloc[1] in pd.to_datetime(d)
Out[62]: False
>>> pd.to_datetime(d.iloc[1]) in pd.to_datetime(d)
Out[63]: False
what's the best to check this?
to answer some of the comments below:
Using values doesnt solve it:
>>> d.iloc[1] in d.values
Out[69]: False
I dont think it is a matter of iloc returning row not value
>>> x= pd.Timestamp('2010-6-2')
>>> x
Out[72]: Timestamp('2010-06-02 00:00:00')
>>> x in d
Out[73]: False
>>> x in pd.to_datetime(d)
Out[74]: False
>>> x in d.values
Out[75]: False
Try this. You are comparing the first value of a pd.Series against the values in the column, which of course will be True.
The reason I believe your comparison does not work is because the in operator acting on pd.Series checks for existence in the series index, not the series values itself. Applying set ensures that the series values are used fo the comparison.
# df
# date
# 0 2010-06-01
# 1 2010-06-02
# 2 2010-06-03
# 3 2010-06-04
# 4 2010-06-07
# convert date column to datetime
df.date = pd.to_datetime(df.date)
df.date[1] in set(df.date)
Here's one possible answer i got on trial and error, not sure if I am missing something.
Checking d shows that it is a dtype datetime64[ns]
>>> d.head()
Out[55]:
0 2010-06-01
1 2010-06-02
2 2010-06-03
3 2010-06-04
4 2010-06-07
dtype: datetime64[ns]
Same happens on d.values
>>> d.values
Out[76]:
array(['2010-05-31T20:00:00.000000000-0400', '2010-06-01T20:00:00.000000000-0400',.....], dtype='datetime64[ns]')
But checking only one of them changes it to timestamp.
>>> d.iloc[1]
Out[82]: Timestamp('2010-06-02 00:00:00')
So i did this which worked:
>>> x= pd.Timestamp('2010-6-2')
>>> x
Out[72]: Timestamp('2010-06-02 00:00:00')
>>> np.datetime64(x) in d.values
Out[77]: True
Checking #jp_data_analysis suggestion of using set also worked as it keeps the format to Timestamp
>>> set(d.iloc[:])
Out[81]:
{Timestamp('2015-10-13 00:00:00'),
Timestamp('2011-07-18 00:00:00'),......
>>> x in set(d.iloc[:])
Out[83]: True
You can do the following, with .isin (note that .isin does require a list as input):
df.date = pd.to_datetime(df.date)
df.date.isin([df.date.iloc[1]])
Related
I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)
I have a dataframe that was read as a string containing a date in the format "YYYY-MM-DD". I had converted the column to datetime using pd.to_datetime (with coerce) and I'm intending to search the column for NaTs using numpy.isnat().
defaultDate = datetime.datetime(2020, 12, 31)
df['dates'] = pd.to_datetime(df['dates'], errors = 'coerce')
df['newDates'] = [x if ~np.isnat(x) else defaultDate for x in df['dates']]
When I tried to run the code, I get the error:
**TypeError**: ufunc 'isnat' is only defined for datetime and timedelta.
I later found out that the dtype of the column had been converted to <M8[ns]. Is there a way to properly to convert to datetime, or is there some way to get around this? I have numpy version 1.16.4.
<M8[ns] is a synonym for datetime64[ns]. Also, you don't need np.isnat if you are dealing with pandas datetime:
defaultDate = pd.to_datetime('2020-12-31')
df['newDates'] = [x if ~np.isnat(x) else defaultDate for x in df['dates']]
df['newDates'] = df['dates'].fillna(defaultDate)
Looks like isnat is meant to test an array like:
In [47]: np.array([0,1,'NaT'], 'datetime64[D]')
Out[47]: array(['1970-01-01', '1970-01-02', 'NaT'], dtype='datetime64[D]')
In [48]: np.isnat(_)
Out[48]: array([False, False, True])
I had to experiment to find out how to generate a NaT element. There may be other ways.
Can you give a dataframe or Series that has sample values, both valid dates and non-dates. That will make it easier to explore ways of filtering. I believe pandas has some sort of of not-a-time element, but I don't know if it's compatible with the numpy one. Keep in mind, also, that pandas readily switches to object dtype when Series elements include strings and None.
Testing a Series:
In [50]: ds = pd.Series(_47)
In [51]: ds
Out[51]:
0 1970-01-01
1 1970-01-02
2 NaT
dtype: datetime64[ns]
In [52]: ds.isna()
Out[52]:
0 False
1 False
2 True
dtype: bool
In [54]: ds.isnull()
Out[54]:
0 False
1 False
2 True
dtype: bool
Change an element of the Series:
In [58]: ds[2]=12
In [59]: ds
Out[59]:
0 1970-01-01 00:00:00
1 1970-01-02 00:00:00
2 12
dtype: object
that changes the dtype
In [60]: ds.values
Out[60]:
array([Timestamp('1970-01-01 00:00:00'), Timestamp('1970-01-02 00:00:00'),
12], dtype=object)
In [61]: np.isnat(_)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-61-47ce91c66a51> in <module>
----> 1 np.isnat(_)
TypeError: ufunc 'isnat' is only defined for datetime and timedelta.
A possible conversion sequence:
A series with a mix of dates and something else, object dtype:
In [118]: ds
Out[118]:
0 1970-01-01 00:00:00
1 1970-01-02 00:00:00
2 12
dtype: object
In [119]: ds1=pd.to_datetime(ds,errors='coerce')
In [120]: ds1
Out[120]:
0 1970-01-01
1 1970-01-02
2 NaT
dtype: datetime64[ns]
conversion with coerce produces a NaT:
In [121]: idx = np.isnat(ds1)
In [122]: idx
Out[122]:
0 False
1 False
2 True
dtype: bool
In [123]: ds1[idx]
Out[123]:
2 NaT
dtype: datetime64[ns]
define the correct default; its dtype is important, since pandas changes the dtype readily (numpy does not):
In [124]: default= np.array('2020-12-31','datetime64[ns]')[()]
In [125]: default
Out[125]: numpy.datetime64('2020-12-31T00:00:00.000000000')
In [126]: ds1[idx]=default
In [127]: ds1
Out[127]:
0 1970-01-01
1 1970-01-02
2 2020-12-31
dtype: datetime64[ns]
How can I identify which column(s) in my DataFrame contain a specific string 'foo'?
Sample DataFrame:
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[10,20,42], 'B':['foo','bar','blah'],'C':[3,4,5], 'D':['some','foo','thing']})
I want to find B and D here.
I can search for numbers:
If I'm looking for a number (e.g. 42) instead of a string, I can generate a boolean mask like this:
>>> ~(df.where(df==42)).isnull().all()
A True
B False
C False
D False
dtype: bool
but not strings:
>>> ~(df.where(df=='foo')).isnull().all()
TypeError: Could not compare ['foo'] with block values
I don't want to iterate over each column and row if possible (my actual data is much larger than this example). It feels like there should be a simple and efficient way.
How can I do this?
One way with underlying array data -
df.columns[(df.values=='foo').any(0)].tolist()
Sample run -
In [209]: df
Out[209]:
A B C D
0 10 foo 3 some
1 20 bar 4 foo
2 42 blah 5 thing
In [210]: df.columns[(df.values=='foo').any(0)].tolist()
Out[210]: ['B', 'D']
If you are looking for just the column-mask -
In [205]: (df.values=='foo').any(0)
Out[205]: array([False, True, False, True], dtype=bool)
Option 1 df.values
~(df.where(df.values=='foo')).isnull().all()
Out[91]:
A False
B True
C False
D True
dtype: bool
Option 2 isin
~(df.where(df.isin(['foo']))).isnull().all()
Out[94]:
A False
B True
C False
D True
dtype: bool
Unfortunately, it won't index a str through the syntax you gave. It has to be run as a series of type string to compare it with string, unless I am missing something.
try this
~df101.where(df101.isin(['foo'])).isnull().all()
A False
B True
C False
D True
dtype: bool
When I display a cell from a dataframe, I get
df[df.a==1]['b']
Out[120]:
0 2
Name: b, dtype: int64
However, when I want to convert it to string, I get
str(df[df.a==1]['b'])
Out[124]: '0 2\nName: b, dtype: int64'
How do I just print the value without dtype and the name without string manipulations?
Just do the following, what is returned is a pandas Series so you need to acess either the values or the name attribute:
In [2]:
df = pd.DataFrame({'a':np.arange(5), 'b':np.random.randn(5)})
df
Out[2]:
a b
0 0 -1.795051
1 1 1.579010
2 2 1.908378
3 3 1.814691
4 4 -0.470358
In [16]:
type(df[df['a']==2]['b'])
Out[16]:
pandas.core.series.Series
In [9]:
df[df['a']==2]['b'].values[0]
Out[9]:
1.9083782154318822
In [15]:
df.loc[df['a']==2,'b'].name
Out[15]:
'b'
My application needs to compare Series instances that sometimes contain nans. That causes ordinary comparison using == to fail, since nan != nan:
import numpy as np
from pandas import Series
s1 = Series([1,np.nan])
s2 = Series([1,np.nan])
>>> (Series([1, nan]) == Series([1, nan])).all()
False
What's the proper way to compare such Series?
How about this. First check the NaNs are in the same place (using isnull):
In [11]: s1.isnull()
Out[11]:
0 False
1 True
dtype: bool
In [12]: s1.isnull() == s2.isnull()
Out[12]:
0 True
1 True
dtype: bool
Then check the values which aren't NaN are equal (using notnull):
In [13]: s1[s1.notnull()]
Out[13]:
0 1
dtype: float64
In [14]: s1[s1.notnull()] == s2[s2.notnull()]
Out[14]:
0 True
dtype: bool
In order to be equal we need both to be True:
In [15]: (s1.isnull() == s2.isnull()).all() and (s1[s1.notnull()] == s2[s2.notnull()]).all()
Out[15]: True
You could also check name etc. if this wasn't sufficient.
If you want to raise if they are different, use assert_series_equal from pandas.util.testing:
In [21]: from pandas.util.testing import assert_series_equal
In [22]: assert_series_equal(s1, s2)
Currently one should just use series1.equals(series2) see docs. This also checks if nans are in the same positions.
I came looking here for a similar answer, and think #Sam's answer is the neatest if you just want 1 value back. But I wanted a truth-array back with an element-wise comparison (but null safe).
So finally I ended up with:
import pandas as pd
s1 = pd.Series([1,np.nan, 2, np.nan])
s2 = pd.Series([1,np.nan, np.nan, 2])
(s1 == s2) | ~(s1.isnull() ^ s2.isnull())
The result:
0 True
1 True
2 False
3 False
dtype: bool
Comparing this to s1 == s2:
0 True
1 False
2 False
3 False
dtype: bool
In [16]: s1 = Series([1,np.nan])
In [17]: s2 = Series([1,np.nan])
In [18]: (s1.dropna()==s2.dropna()).all()
Out[18]: True