I want to perform a lookup in a dataframe via pandas. But It will be created by a series of nested if else statement similar as outlined Pandas dataframe add a field based on multiple if statements
But I want to use up to 13 different variables. This seems to pretty soon result in chaos. Is there some notation or other nice feature which allows me to specify such long and nested conditions in pandas?
So far np.where() http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html might be my best option.
Is there a shortcut if I would only match for equality in all conditions?
Am I forced to write out each conditional filter? Ir could I just have a single expression which is choosing a (single) lookup value which is produced.
Edit Ideally I would not want to match
df.loc[df['column_name'] == some_value]
for each value ie. 13* number of categorical levels (lets assume 7) would be a lot of different values; especially, if df.loc[df['fist'] == some_value][df['second'] == otherValue1] combination of conditions occur i.e. they are all chained.
edit
A minimal example
df = pd.DataFrame({'ageGroup': [1, 2, 2, 1],
'first2DigitsOfPostcode': ['12', '23', '12', '12'],
'valueOfProduct': ['low', 'medum', 'high', 'low'],
'lookup_join_value': ['foo', 'bar', 'foo', 'baz']})
defines the lookup table which was generated by a sql query grouping by all the columns and aggregating the values (so due to the Cartesian product all. value combinations should be represented in the lookup table.
A new record could look like
new_values = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct': ['low']})
How can I sort of automate the lookup of all the conditions assuming all conditions require an match by equality (if this makes it easier.
I found
pd.lookup Vectorized look-up of values in Pandas dataframe which seems to work for a single column / condition
maybe a merge could be a solution? Python Pandas: DataFrame as a Lookup Table, but that not really produce the desired lookup result.
edit 2
The second answer seems to be pretty interesting. But
mask = df.drop('lookup_join_value', axis=1).isin(new_values)
print(mask)
print(df[mask])
print(df[mask]['lookup_join_value'])
will unfortunately just return NaN for the lookup value.
Now that I better know what you're after, a dataframe merge is likely a much better choice:
IN: df.merge(new_values, how='inner')
OUT: ageGroup first2DigitsOfPostcode lookup_join_value valueOfProduct
0 1 12 foo low
1 1 12 baz low
Certainly shorter than the other answer I gave! I'll leave the old one though in case it inspires someone else.
I think df.isin() is along the lines of what you're looking for.
Using your example df, and these two:
exists = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'low'})
new = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'high'})
Then you can check to see what values match, if all, or just some:
df.isin(exists.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True True
1 False False False
2 False True False
3 True True True
df.isin(new.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True False
1 False False False
2 False True True
3 True True False
If your "query" wasn't a dataframe but instead a list it wouldn't need the ".values[0]" bit. The problem with a dictionary is it tries to match the index as well.
It's not clear to me from your question exactly what you want returned, but you could then subset based on whether all (or some) of the rows are the same:
# Returns matching rows
df[df.isin(exists.values[0]).values.all(True)]
# Returns rows where the first two columns match
matches = df.isin(new.values[0]).values
df[[item==[True,True,False] for item in matches.tolist()]]
...There might be a smarter way to write that last one.
df.query is an option if you can write if can the query as and expression using the column names:
so you can do:
query_string = 'some long (but valid) boolean query'
example from pandas:
>>> from numpy.random import randn
>>> from pandas import DataFrame
>>> df = DataFrame(randn(10, 2), columns=list('ab'))
>>> df.query('a > b')
# similar to this
>>> df[df.a > df.b]
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html
Related
Tring to conditionally execute a query, only when ColumnA = 'New' and ColumnB = 'Left' (in each individual row). I know that str.contains() works when I only have 1 condition, however, I'm getting a ValueError ("ValueError: The truth value of a Series is ambiguous..."). Can this approach not be successfully applied, for my given scenario? Alternatively, is there a better approach?
Current code:
if df1['ColumnA'].str.contains('New') and df1['ColumnB'].str.contains('Left'):
do something...
Very basic example of the dataframe:
ColumnA
Column B
New
Left
Used
Right
Scrap
Down
New
Right
First row would be the desired row to carry forward (since it meets the criteria).
You have the right idea, however it doesn't appear like your code is expressing exactly what you want to do.
df1['ColumnA'].str.contains('New')
will return a Series with true and false values corresponding to the indices where the condition is true, not a true or false value for whether the entire column contains 'new'. To accomplish this consider doing something like the following:
'new' in df['ColumnA'].values
If you are trying to do it on a row by row basis then you must use the bitwise operator to compare truth values across Series (&).
This will return a boolean like you expected, hopefully this helps (:
Use bitwise & on two mask arrays and generate another column.
>>> import pandas as pd
>>> df = pd.DataFrame({'A':['New','Used','Scrap','New'], 'B':['Left','Right','Down','Right']})
>>> df
A B
0 New Left
1 Used Right
2 Scrap Down
3 New Right
>>> df['C'] = df['A'].str.contains('New') & df['B'].str.contains('Left')
>>> df
A B C
0 New Left True
1 Used Right False
2 Scrap Down False
3 New Right False
>>>
Fairly new to python. This seems to be a really simple question but I can't find any information about it.
I have a list of strings, and for each string I want to check whether it is present in a dataframe (actually in a particular column of the dataframe. Not whether a substring is present, but the whole exact string.
So my dataframe is something like the following:
A=pd.DataFrame(["ancestry","time","history"])
I should simply be able to use the "string in dataframe" method, as in
"time" in A
This returns False however.
If I run
"time" == A.iloc[1]
it returns "True", but annoyingly as part of a series, and this depends on knowing where in the dataframe the corresponding string is.
Is there some way I can just use the string in df method, to easily find out whether the strings in my list are in the dataframe?
Add .to_numpy() to the end:
'time' in A.to_numpy()
As you've noticed, the x in pandas.DataFrame syntax doesn't produce the result you want. But .to_numpy() transforms the dataframe into a Numpy array, and x in numpy.array works as you expect.
The way to deal with this is to compare the whole dataframe with "time". That will return a mask where each value of the DF is True if it was time, False otherwise. Then, you can use .any() to check if there are any True values:
>>> A = pd.DataFrame(["ancestry","time","history"])
>>> A
0
0 ancestry
1 time
2 history
>>> A == "time" # or A.eq("time")
0
0 False
1 True
2 False
>>> (A == "time").any()
0 True
dtype: bool
Notice in the above output, (A == "time").any() returns a Series where each entry is a column and whether or not that column contained time. If you want to check the entire dataframe (across all columns), call .any() twice:
>>> (A == "time").any().any()
True
I believe (myseries==mystr).any() will do what you ask. The special __contains__ method of DataFrames (which informs behavior of in) checks whether your string is a column of the DataFrame, e.g.
>>> A = pd.DataFrame({"c": [0,1,2], "d": [3,4,5]})
>>> 'c' in A
True
>>> 0 in A
False
I would slightly modify your dataframe and use .str.contains for checking where the string is present in your series.
df=pd.DataFrame()
df['A']=pd.Series(["ancestry","time","history"])
df['A'].str.contains("time")
I am new to Pandas (but not to data science and Python). This question is not anly about how to solve this specific problem but how to handle problems like this the panda-way.
Please feel free to improve the title of that question. Because I am not sure what are the correct terms here.
Here is my MWE
#!/usr/bin/env python3
import pandas as pd
data = {'A': [1, 2, 3, 3, 1, 4],
'B': ['One', 'Two', 'Three', 'Three', 'Eins', 'Four']}
df = pd.DataFrame(data)
print(df)
Resulting in
A B
0 1 One
1 2 Two
2 3 Three
3 3 Three
4 1 Eins
5 4 Four
My assumption is that when the value in A column is 1 that the value in B column is always One. And so on...
I want to proof that assumption.
Secondary I also assume that if my first assumption is incorrect that this is not an error but there are valid (human) reasons for that. e.g. see row index 4 where the A-value is related to Eins (and not One) in the B column.
Because of that I also need to see and explore the cases where my assumption is incorrect.
Update of the question:
This data is only an example. In real world I am not aware of the pairing of the two columns. Because of that solutions like this do not work in my case
df.loc[df['A'] == 1, 'B']
I do not know how many and which expressions are in column A.
I do not know how to do that with pandas. How would a panda professional would solve this?
My approach would be to use pure Python code with list(), set() and some iterations. ;)
You can filter your data frame this way:
df.loc[df['A'] == 1, 'B']
This gives you the values of B where A is 1. Next you can add an equals statement:
df.loc[df['A'] == 1, 'B'] == 'One'
Which results in a boolean series (True, False in this case). If you want to check if all are true, you add:
all(df.loc[df['A'] == 1, 'B'] == 'One')
And the answer is False because of the Eins.
EDIT
If you want to create a new column which says if your criterion is met (always the same value for B if A) then you can do this:
df['C'] = df['A'].map(df.groupby('A')['B'].nunique() < 2)
Which results in a bool column. It creates column C by mapping the values in A in a by the list in the brackets. In between the brackets it is a groupby function of the values in A and counting the unique values in B. If that is under 2 it is unique it yields True.
If solution should be testing if only one unique value per A and return all rows which failed use DataFrameGroupBy.nunique for count unique values in GroupBy.transform for repeat aggregate values per groups, so possible filter rows which are not 1, it means there are 2 or more unique values per A:
df1 = df[df.groupby('A').B.transform('nunique').ne(1)]
print (df1)
A B
0 1 One
4 1 Eins
if df1.empty:
print ('My assumption is good')
else:
print ('My assumption is wrong')
print (df1)
I am working with a Pandas dataframe similar to with:
My_date Something_else My_list
0 25/10/2019 ... [25/10/2019, 26/10/2019]
1 03/07/2019 ... [28/11/2017, 12/12/2017, 26/12/2017]
2 09/04/2019 ... [11/06/2015]
I would like to check if the value in the column called "My_date" is in the list on the same row, column "My_list". For instance here I would like to get the following output, vectorially or very efficiently:
Result
0 true
1 false
2 false
I could do this using a 'for' loop, various methods are described here for instance. However, I am aware that iterating is rarely the best solution, all the more since my table has more than 1 million rows and many of the list have 365 values. (But as shown above, these lists are not always date ranges.)
I know that there are many ways to do vectorial calculation on DataFrames, using .loc or .eval for instance. The point is that in my case, nothing works as expected due to these nested lists... Thus, I would like to find a vectorized solution to do that. If it matters, all my "dates" are of type pandas.Timestamp.
There are probably other questions related to similar issues, however I haven't found any appropriate answer or question using my own words.
Try:
df['Result'] = df.apply(lambda x: x['My_date'] in x['My_list'], axis=1)
df=pd.DataFrame({'My_date' : ['25/10/2019','03/07/2019','09/04/2019'], 'My_list' : [['25/10/2019', '26/10/2019'],['28/11/2017', '12/12/2017', '26/12/2017'],['11/06/2015']]})
df['Result'] = df.apply(lambda x: x['My_date'] in x['My_list'], axis=1)
Outputs:
My_date My_list Result
0 25/10/2019 [25/10/2019, 26/10/2019] True
1 03/07/2019 [28/11/2017, 12/12/2017, 26/12/2017] False
2 09/04/2019 [11/06/2015] False
I am trying to split my dataframe into two based of medical_plan_id. If it is empty, into df1. If not empty into df2.
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]
The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison").
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
How to handle such situation?
My df_with_medicalplanid looks like below:
wellthie_issuer_identifier ... medical_plan_id
0 UHC99806 ... None
1 UHC99806 ... None
Use ==, not is, to test equality
Likewise, use != instead of is not for inequality.
is has a special meaning in Python. It returns True if two variables point to the same object, while == checks if the objects referred to by the variables are equal. See also Is there a difference between == and is in Python?.
Don't repeat mask calculations
The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~ ("tilde"), also accessible via operator.invert, to negate an existing mask.
Empty strings are different to null values
Equality versus empty strings can be tested via == '', but equality versus null values requires a specialized method: pd.Series.isnull. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan, and np.nan != np.nan by design.
If you want to replace empty strings with null values, you can do so:
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
Conceptually, it makes sense for missing values to be null (np.nan) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:
df['medical_plan_id'] = df['medical_plan_id'].fillna('')
If the difference matters, you need to know your data and apply the appropriate logic.
Semi-final solution
Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:
mask = df['medical_plan_id'].isnull()
df1 = df[mask]
df2 = df[~mask]
Final solution: avoid extra variables
Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy with dict to give a dictionary of dataframes with False (== 0) and True (== 1) keys corresponding to your masks:
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
Then dfs[0] represents df2 and dfs[1] represents df1 (see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy methods:
dfs = df.groupby(df['medical_plan_id'].isnull())
dfs.get_group(0) # equivalent to dfs[0] from dict solution
dfs.get_group(1) # equivalent to dfs[1] from dict solution
Example
Putting all the above in action:
df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
'values': [1, 2, 3, 4, 5, 6, 7]})
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
print(dfs[0], dfs[1], sep='\n'*2)
medical_plan_id values
2 2134.0 3
3 4325.0 4
4 6543.0 5
medical_plan_id values
0 NaN 1
1 NaN 2
5 NaN 6
6 NaN 7
Another variant is to unpack df.groupby, which returns an iterator with tuples (first item being the element of groupby and the second being the dataframe).
Like this for instance:
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
_ is in Python used to mark variables that are not interested to keep. I have separated the code to two lines for readability.
Full example
import pandas as pd
df_with_medicalplanid = pd.DataFrame({
'medical_plan_id': ['214212','','12251','12421',''],
'value': 1
})
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
print(df1)
Returns:
medical_plan_id value
0 214212 1
2 12251 1
3 12421 1
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
# Anton missed cond in right side bracket
print(df1)