Using isin() with - python

I am working with a pandas Series and I am trying to use the isin() method to find some of the members of the series. However, for pandas timestamp objects, this function does not appear to be working correctly.
import pandas
data = pandas.date_range('jan-01-2013','jan-05-2013')
s = pandas.Series(data)
print s.iloc[0] == data[0] # Returns True (correct)
print s.isin(data[0:2]) # Returns a series of all false values (incorrect)
Obviously, for the second print statement, the expected result is that the first two members of the series are true and everything else is false. However, it returns all false values. Is this a bug or am I implementing isin() incorrectly?

it's working like this:
s.isin(data[0:2].values)

Related

Remove entries of pandas multiindex if function returns false

I have a function that receives a whole entry of a multiindex that returns true if or false for the entire index.
Hereby I am feeding several columns of the entry as a key value pair e.g.:
temp = cells.loc[0]
x = temp.set_index(['eta','phi'])['e'].to_dict()
filter_frame(x,20000) # drop event if this function returns false
So far I only found examples where people want to remove single rows but I am talking an entire entry with several hundred subentries, as all subentries are used to output the boolean.
How can I drop entries that dont fulfill this condition?
Edit:
Data sample
The filter_frame() function would just produce a true or false for this entry 0, which contains 780 rows.
The function also works fine, I just dont know how to apply it without doing slow for loops.
What I am looking for is something like this
cells = cells[apply the filter function somehow for all entries]
and have a significantly smaller dataframe
Edit2:
print(mask) of jezraels solution:
Frst call function per first level of MultiIndex in GroupBy.apply - get mask per groups, so for filtering original DataFrame use MultiIndex.droplevel for remove second level with mapping by Index.map, so possible filtering in boolean indexing:
def f(temp):
x = temp.set_index(['eta','phi'])['e'].to_dict()
return filter_frame(x,20000)
mask = cells.index.droplevel(1).map(cells.groupby(level=0).apply(f))
out = cells[mask]

If Condition Based On 2 Columns

Tring to conditionally execute a query, only when ColumnA = 'New' and ColumnB = 'Left' (in each individual row). I know that str.contains() works when I only have 1 condition, however, I'm getting a ValueError ("ValueError: The truth value of a Series is ambiguous..."). Can this approach not be successfully applied, for my given scenario? Alternatively, is there a better approach?
Current code:
if df1['ColumnA'].str.contains('New') and df1['ColumnB'].str.contains('Left'):
do something...
Very basic example of the dataframe:
ColumnA
Column B
New
Left
Used
Right
Scrap
Down
New
Right
First row would be the desired row to carry forward (since it meets the criteria).
You have the right idea, however it doesn't appear like your code is expressing exactly what you want to do.
df1['ColumnA'].str.contains('New')
will return a Series with true and false values corresponding to the indices where the condition is true, not a true or false value for whether the entire column contains 'new'. To accomplish this consider doing something like the following:
'new' in df['ColumnA'].values
If you are trying to do it on a row by row basis then you must use the bitwise operator to compare truth values across Series (&).
This will return a boolean like you expected, hopefully this helps (:
Use bitwise & on two mask arrays and generate another column.
>>> import pandas as pd
>>> df = pd.DataFrame({'A':['New','Used','Scrap','New'], 'B':['Left','Right','Down','Right']})
>>> df
A B
0 New Left
1 Used Right
2 Scrap Down
3 New Right
>>> df['C'] = df['A'].str.contains('New') & df['B'].str.contains('Left')
>>> df
A B C
0 New Left True
1 Used Right False
2 Scrap Down False
3 New Right False
>>>

Pandas dataframe reports no matching string when the string is present

Fairly new to python. This seems to be a really simple question but I can't find any information about it.
I have a list of strings, and for each string I want to check whether it is present in a dataframe (actually in a particular column of the dataframe. Not whether a substring is present, but the whole exact string.
So my dataframe is something like the following:
A=pd.DataFrame(["ancestry","time","history"])
I should simply be able to use the "string in dataframe" method, as in
"time" in A
This returns False however.
If I run
"time" == A.iloc[1]
it returns "True", but annoyingly as part of a series, and this depends on knowing where in the dataframe the corresponding string is.
Is there some way I can just use the string in df method, to easily find out whether the strings in my list are in the dataframe?
Add .to_numpy() to the end:
'time' in A.to_numpy()
As you've noticed, the x in pandas.DataFrame syntax doesn't produce the result you want. But .to_numpy() transforms the dataframe into a Numpy array, and x in numpy.array works as you expect.
The way to deal with this is to compare the whole dataframe with "time". That will return a mask where each value of the DF is True if it was time, False otherwise. Then, you can use .any() to check if there are any True values:
>>> A = pd.DataFrame(["ancestry","time","history"])
>>> A
0
0 ancestry
1 time
2 history
>>> A == "time" # or A.eq("time")
0
0 False
1 True
2 False
>>> (A == "time").any()
0 True
dtype: bool
Notice in the above output, (A == "time").any() returns a Series where each entry is a column and whether or not that column contained time. If you want to check the entire dataframe (across all columns), call .any() twice:
>>> (A == "time").any().any()
True
I believe (myseries==mystr).any() will do what you ask. The special __contains__ method of DataFrames (which informs behavior of in) checks whether your string is a column of the DataFrame, e.g.
>>> A = pd.DataFrame({"c": [0,1,2], "d": [3,4,5]})
>>> 'c' in A
True
>>> 0 in A
False
I would slightly modify your dataframe and use .str.contains for checking where the string is present in your series.
df=pd.DataFrame()
df['A']=pd.Series(["ancestry","time","history"])
df['A'].str.contains("time")

I am trying to assign a Holiday classifier to a list of dates

I have two dataframes, one with a list of dates and their corresponding holiday (df2), and another one with a list of transactions (df1). I'm trying to use the first one to flag holidays on the second one, but whenever I try to create a function and apply it, it just returns empty values for everything.
The function I'm using is as follows:
def isHoliday(t, holiday_list):
f = t.strftime('%Y-%m-%d')
if(f in (holiday_list)):
return 1
else:
return 0
And when I try to apply it:
df1.insert(3, 'isHoliday', df1['DATE'].apply(lambda x: isHoliday(x,
df2['DATE'])))
The dataframe only returns 0's. I've looked up date to date comparison and the answer I got from it was to compare them as strings, hence why the function is structured in that way.
What am I doing wrong? I've already preformatted the df2['DATE'] column as a string with the same strftime()
The only direct alternative I can think of is using df.lookup from one df to the other, but I'm not sure how to do it.
For the if statement to do what you're expecting you need to get a list or a numpy array from the Series returned by the df2['DATE'] operation. You can either do it by using the .values property or converting the series to a list list(df2['DATE']):
import pandas as pd
df2 = pd.DataFrame(data=[['2014-01-02'], ['2014-01-03']], columns=['DATE'])
print('2014-01-02' in df2['DATE']) # false
print('2014-01-02' in df2['DATE'].values) # true
print('2014-01-02' in list(df2['DATE'])) # true
Alternatively, the .str.contains() method can compare all the strings and then any() will find if there was a match.
any(df2['DATE'].str.contains('2014-01-02', regex=False)) # true
Converting your series to a list should solve your problem:
def isHoliday(t, holiday_list):
f = t.strftime('%Y-%m-%d')
if f in list(holiday_list): # convert series to list
return 1
else:
return 0

Find if a column in dataframe has neither nan nor none

I have gone through all posts on the website and am not able to find solution to my problem.
I have a dataframe with 15 columns. Some of them come with None or NaN values. I need help in writing the if-else condition.
If the column in the dataframe is not null and nan, I need to format the datetime column. Current Code is as below
for index, row in df_with_job_name.iterrows():
start_time=df_with_job_name.loc[index,'startTime']
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
The error that I am getting is
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
TypeError: isna() takes exactly 1 argument (2 given)
A direct way to take care of missing/invalid values is probably:
def is_valid(val):
if val is None:
return False
try:
return not math.isnan(val)
except TypeError:
return True
and of course you'll have to import math.
Also it seems isna is not invoked with any argument and returns a dataframe of boolean values (see link). You can iterate thru both dataframes to determine if the value is valid.
isna takes your entire data frame as the instance argument (that's self, if you're already familiar with classes) and returns a data frame of Boolean values, True where that value is invalid. You tried to specify the individual value you're checking as a second input argument. isna doesn't work that way; it takes empty parentheses in the call.
You have a couple of options. One is to follow the individual checking tactics here. The other is to make the map of the entire data frame and use that:
null_map_df = df_with_job_name.isna()
for index, row in df_with_job_name.iterrows() :
if not null_map_df.loc[index,row]) :
start_time=df_with_job_name.loc[index,'startTime']
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
Please check my use of row & column indices; the index, row handling doesn't look right. Also, you should be able to apply an any operation to the entire row at once.

Categories