If Condition Based On 2 Columns - python

Tring to conditionally execute a query, only when ColumnA = 'New' and ColumnB = 'Left' (in each individual row). I know that str.contains() works when I only have 1 condition, however, I'm getting a ValueError ("ValueError: The truth value of a Series is ambiguous..."). Can this approach not be successfully applied, for my given scenario? Alternatively, is there a better approach?
Current code:
if df1['ColumnA'].str.contains('New') and df1['ColumnB'].str.contains('Left'):
do something...
Very basic example of the dataframe:
ColumnA
Column B
New
Left
Used
Right
Scrap
Down
New
Right
First row would be the desired row to carry forward (since it meets the criteria).

You have the right idea, however it doesn't appear like your code is expressing exactly what you want to do.
df1['ColumnA'].str.contains('New')
will return a Series with true and false values corresponding to the indices where the condition is true, not a true or false value for whether the entire column contains 'new'. To accomplish this consider doing something like the following:
'new' in df['ColumnA'].values
If you are trying to do it on a row by row basis then you must use the bitwise operator to compare truth values across Series (&).
This will return a boolean like you expected, hopefully this helps (:

Use bitwise & on two mask arrays and generate another column.
>>> import pandas as pd
>>> df = pd.DataFrame({'A':['New','Used','Scrap','New'], 'B':['Left','Right','Down','Right']})
>>> df
A B
0 New Left
1 Used Right
2 Scrap Down
3 New Right
>>> df['C'] = df['A'].str.contains('New') & df['B'].str.contains('Left')
>>> df
A B C
0 New Left True
1 Used Right False
2 Scrap Down False
3 New Right False
>>>

Related

Pandas dataframe reports no matching string when the string is present

Fairly new to python. This seems to be a really simple question but I can't find any information about it.
I have a list of strings, and for each string I want to check whether it is present in a dataframe (actually in a particular column of the dataframe. Not whether a substring is present, but the whole exact string.
So my dataframe is something like the following:
A=pd.DataFrame(["ancestry","time","history"])
I should simply be able to use the "string in dataframe" method, as in
"time" in A
This returns False however.
If I run
"time" == A.iloc[1]
it returns "True", but annoyingly as part of a series, and this depends on knowing where in the dataframe the corresponding string is.
Is there some way I can just use the string in df method, to easily find out whether the strings in my list are in the dataframe?
Add .to_numpy() to the end:
'time' in A.to_numpy()
As you've noticed, the x in pandas.DataFrame syntax doesn't produce the result you want. But .to_numpy() transforms the dataframe into a Numpy array, and x in numpy.array works as you expect.
The way to deal with this is to compare the whole dataframe with "time". That will return a mask where each value of the DF is True if it was time, False otherwise. Then, you can use .any() to check if there are any True values:
>>> A = pd.DataFrame(["ancestry","time","history"])
>>> A
0
0 ancestry
1 time
2 history
>>> A == "time" # or A.eq("time")
0
0 False
1 True
2 False
>>> (A == "time").any()
0 True
dtype: bool
Notice in the above output, (A == "time").any() returns a Series where each entry is a column and whether or not that column contained time. If you want to check the entire dataframe (across all columns), call .any() twice:
>>> (A == "time").any().any()
True
I believe (myseries==mystr).any() will do what you ask. The special __contains__ method of DataFrames (which informs behavior of in) checks whether your string is a column of the DataFrame, e.g.
>>> A = pd.DataFrame({"c": [0,1,2], "d": [3,4,5]})
>>> 'c' in A
True
>>> 0 in A
False
I would slightly modify your dataframe and use .str.contains for checking where the string is present in your series.
df=pd.DataFrame()
df['A']=pd.Series(["ancestry","time","history"])
df['A'].str.contains("time")

Create a new variable based on 4 other variables

I have a dataframe in Python called df1 where I have 4 dichotomous variables called Ordering_1; Ordering_2, Ordering_3, Ordering_4 with True/False values.
I need to create a variable called Clean, which is based on the 4 other variables. Meaning, when Ordering_1 == True, then Clean == Ordering_1, when Ordering_2==True, then Clean == Ordering_2. Then Clean would be a combination of all the true values from Ordering_1; Ordering_2, Ordering_3, Ordering_4.
Here is an example of how I would like the variable Clean to be:
I have tried the below code but it does not work:
df1[Clean] = df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1]
Would anyone please be able to help me how to do this in python?
Universal solution if there are multiple Trues per rows - filter columns by DataFrame.filter and then use DataFrame.dot for matrix multiplication:
df1 = df.filter(like='Ordering_')
df['Clean'] = df1.dot(df1.columns + ',').str.strip(',')
If there is only one "True" value per row you can use the booleans of each column "Ordering_1", "Ordering_2", etc. and the df1.loc.
Note that this is what you get with df1.Ordering_1:
0 True
1 False
2 False
3 False
Name: Ordering_1, dtype: bool
With df1.loc you can use it to filter on the "True" rows, in this case only row 0:
So you can code this:
Create a new blank "clean" column:
df1["clean"]=""
Set the rows where the series df.Ordering_1 = True to "Ordering_1":
df1.loc[df1.Ordering_1,["clean"]] = "Ordering_1"
Proceed with the remaining columns in the same way.

Pandas Dataframes: is the value of a column in a list nested in another column, same row?

I am working with a Pandas dataframe similar to with:
My_date Something_else My_list
0 25/10/2019 ... [25/10/2019, 26/10/2019]
1 03/07/2019 ... [28/11/2017, 12/12/2017, 26/12/2017]
2 09/04/2019 ... [11/06/2015]
I would like to check if the value in the column called "My_date" is in the list on the same row, column "My_list". For instance here I would like to get the following output, vectorially or very efficiently:
Result
0 true
1 false
2 false
I could do this using a 'for' loop, various methods are described here for instance. However, I am aware that iterating is rarely the best solution, all the more since my table has more than 1 million rows and many of the list have 365 values. (But as shown above, these lists are not always date ranges.)
I know that there are many ways to do vectorial calculation on DataFrames, using .loc or .eval for instance. The point is that in my case, nothing works as expected due to these nested lists... Thus, I would like to find a vectorized solution to do that. If it matters, all my "dates" are of type pandas.Timestamp.
There are probably other questions related to similar issues, however I haven't found any appropriate answer or question using my own words.
Try:
df['Result'] = df.apply(lambda x: x['My_date'] in x['My_list'], axis=1)
df=pd.DataFrame({'My_date' : ['25/10/2019','03/07/2019','09/04/2019'], 'My_list' : [['25/10/2019', '26/10/2019'],['28/11/2017', '12/12/2017', '26/12/2017'],['11/06/2015']]})
df['Result'] = df.apply(lambda x: x['My_date'] in x['My_list'], axis=1)
Outputs:
My_date My_list Result
0 25/10/2019 [25/10/2019, 26/10/2019] True
1 03/07/2019 [28/11/2017, 12/12/2017, 26/12/2017] False
2 09/04/2019 [11/06/2015] False

How to conditionally slice a dataframe in pandas

Consider a pandas DataFrame constructed like:
df = pandas.DataFrame({'a':['one','two','three']})
then I can locate the specific row of the dataframe containing two like:
df[df.a == 'two']
but so far the only way I have found to subset the DataFrame up to this row is like:
df[:df[df.a == 'two'].index[0]]
but that is quite ugly, so:
Is there a more appropriate way to accomplish this subsetting?
Specifically, I am interested in how to slice the DataFrame between row indices where a given column matches some arbitrary text string (in this case 'two'). For this particular case it would be equivalent to df[:2]. In general however, the ability to locate an index for the start and/or end of a slice based on column values seems like a reasonable thing?
one last example, maybe will help; I would expect to be able to do something like this:
df[df.a == 'one' : df.a == 'three']
to get a slice containing rows 1 & 2 of the DataFrame, equivalent to df[0:3]
You want to identify the indices for a particular start and stop values and get the matching rows plus all the rows in between. One way is to find the indexes and build a range, but you already said that you don't like that approach. Here is a general solution using boolean logic that should work for you.
First, let's make a more interesting example:
import pandas as pd
df = pd.DataFrame({'a':['one','two','three', 'four', 'five']})
Suppose start = "two" and stop = "four". That is, you want to get the following output DataFrame:
a
1 two
2 three
3 four
We can find the index of the bounding rows via:
df["a"].isin({start, stop})
#0 False
#1 True
#2 False
#3 True
#4 False
#Name: a, dtype: bool
If the value for index 2 were True, we would be done as we could just use this output as a mask. So let's find a way to create the mask we need.
First we can use cummax() and the boolean XOR operator (^) to achieve:
(df["a"]==start).cummax() ^ (df["a"]==stop).cummax()
#0 False
#1 True
#2 True
#3 False
#4 False
#Name: a, dtype: bool
This is almost what we want, except we are missing the stop value index. So let's just bitwise OR (|) the stop condition:
#0 False
#1 True
#2 True
#3 True
#4 False
#Name: a, dtype: bool
This gets the result we are looking for. So create a mask, and index the dataframe:
mask = (df["a"]==start).cummax() ^ (df["a"]==stop).cummax() | (df["a"]==stop)
print(df[mask])
# a
#1 two
#2 three
#3 four
We can extend these findings into a function that also supports indexing up to a row or indexing from a row to the end:
def get_rows(df, col, start, stop):
if start is None:
mask = ~((df[col] == stop).cummax() ^ (df[col] == stop))
else:
mask = (df[col]==start).cummax() ^ (df[col]==stop).cummax() | (df[col]==stop)
return df[mask]
# get rows between "two" and "four" inclusive
print(get_rows(df=df, col="a", start="two", stop="four"))
# a
#1 two
#2 three
#3 four
# get rows from "two" until the end
print(get_rows(df=df, col="a", start="two", stop=None))
# a
#1 two
#2 three
#3 four
#4 five
# get rows up to "two"
print(get_rows(df=df, col="a", start=None, stop="two"))
# a
#0 one
#1 two
Update:
For completeness, here is the indexing based solution.
def get_rows_indexing(df, col, start, stop):
min_ind = min(df.index[df[col]==start].tolist() or [0])
max_ind = max(df.index[df[col]==stop].tolist() or [len(df)])
return df[min_ind:max_ind+1]
This function does essentially the same thing as the other version, but it may be easier to understand. Also this is more robust, as the other version relies on None not being a value in the desired column.
If you temorarily use column 'a' as an index, then the locate method (loc) does exactly what you are asking.
df = pd.DataFrame({'a':['one','two','three', 'four', 'five']})
start = 'two'
stop = 'four'
df = df.set_index('a').loc[start:stop].reset_index()

pandas lookup with long and nested conditions

I want to perform a lookup in a dataframe via pandas. But It will be created by a series of nested if else statement similar as outlined Pandas dataframe add a field based on multiple if statements
But I want to use up to 13 different variables. This seems to pretty soon result in chaos. Is there some notation or other nice feature which allows me to specify such long and nested conditions in pandas?
So far np.where() http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html might be my best option.
Is there a shortcut if I would only match for equality in all conditions?
Am I forced to write out each conditional filter? Ir could I just have a single expression which is choosing a (single) lookup value which is produced.
Edit Ideally I would not want to match
df.loc[df['column_name'] == some_value]
for each value ie. 13* number of categorical levels (lets assume 7) would be a lot of different values; especially, if df.loc[df['fist'] == some_value][df['second'] == otherValue1] combination of conditions occur i.e. they are all chained.
edit
A minimal example
df = pd.DataFrame({'ageGroup': [1, 2, 2, 1],
'first2DigitsOfPostcode': ['12', '23', '12', '12'],
'valueOfProduct': ['low', 'medum', 'high', 'low'],
'lookup_join_value': ['foo', 'bar', 'foo', 'baz']})
defines the lookup table which was generated by a sql query grouping by all the columns and aggregating the values (so due to the Cartesian product all. value combinations should be represented in the lookup table.
A new record could look like
new_values = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct': ['low']})
How can I sort of automate the lookup of all the conditions assuming all conditions require an match by equality (if this makes it easier.
I found
pd.lookup Vectorized look-up of values in Pandas dataframe which seems to work for a single column / condition
maybe a merge could be a solution? Python Pandas: DataFrame as a Lookup Table, but that not really produce the desired lookup result.
edit 2
The second answer seems to be pretty interesting. But
mask = df.drop('lookup_join_value', axis=1).isin(new_values)
print(mask)
print(df[mask])
print(df[mask]['lookup_join_value'])
will unfortunately just return NaN for the lookup value.
Now that I better know what you're after, a dataframe merge is likely a much better choice:
IN: df.merge(new_values, how='inner')
OUT: ageGroup first2DigitsOfPostcode lookup_join_value valueOfProduct
0 1 12 foo low
1 1 12 baz low
Certainly shorter than the other answer I gave! I'll leave the old one though in case it inspires someone else.
I think df.isin() is along the lines of what you're looking for.
Using your example df, and these two:
exists = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'low'})
new = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'high'})
Then you can check to see what values match, if all, or just some:
df.isin(exists.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True True
1 False False False
2 False True False
3 True True True
df.isin(new.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True False
1 False False False
2 False True True
3 True True False
If your "query" wasn't a dataframe but instead a list it wouldn't need the ".values[0]" bit. The problem with a dictionary is it tries to match the index as well.
It's not clear to me from your question exactly what you want returned, but you could then subset based on whether all (or some) of the rows are the same:
# Returns matching rows
df[df.isin(exists.values[0]).values.all(True)]
# Returns rows where the first two columns match
matches = df.isin(new.values[0]).values
df[[item==[True,True,False] for item in matches.tolist()]]
...There might be a smarter way to write that last one.
df.query is an option if you can write if can the query as and expression using the column names:
so you can do:
query_string = 'some long (but valid) boolean query'
example from pandas:
>>> from numpy.random import randn
>>> from pandas import DataFrame
>>> df = DataFrame(randn(10, 2), columns=list('ab'))
>>> df.query('a > b')
# similar to this
>>> df[df.a > df.b]
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

Categories